Skip to main content

Mathematical Infrastructure

Stochastic Calculus for ML

Brownian motion, Ito integrals, Ito's lemma, and stochastic differential equations: the mathematical machinery behind diffusion models, score-based generative models, and Langevin dynamics.

AdvancedTier 3StableSupporting~35 min

Why This Matters

Diffusion models (DDPM, score-based models) generate data by reversing a stochastic differential equation. Langevin dynamics, used in MCMC sampling and score matching, is a specific SDE. Understanding these models requires stochastic calculus: the extension of ordinary calculus to processes driven by Brownian motion.

The central surprise of stochastic calculus is that the chain rule changes. When you apply a smooth function to a process driven by Brownian motion, you pick up an extra second-order term that does not appear in ordinary calculus. This is Ito's lemma, and it is the single most important formula in this topic.

Brownian Motion

Definition

Standard Brownian Motion

A continuous-time stochastic process {Wt}t0\{W_t\}_{t \geq 0} satisfying:

  1. W0=0W_0 = 0
  2. Independent increments: WtWsW_t - W_s is independent of {Wu:us}\{W_u : u \leq s\} for s<ts < t
  3. Gaussian increments: WtWsN(0,ts)W_t - W_s \sim \mathcal{N}(0, t - s)
  4. Continuous paths: tWtt \mapsto W_t is continuous almost surely

Key properties that make Brownian motion different from smooth functions:

  • Nowhere differentiable: WtW_t is continuous but has no derivative at any point, almost surely. This is why Riemann-Stieltjes integration fails.
  • Quadratic variation: i(Wti+1Wti)2t\sum_{i} (W_{t_{i+1}} - W_{t_i})^2 \to t as the partition becomes finer. For smooth functions, quadratic variation is zero. This nonzero quadratic variation is the source of the extra term in Ito's lemma.
  • Scaling: {c1/2Wct}t0\{c^{-1/2} W_{ct}\}_{t \geq 0} is also a standard Brownian motion. The t\sqrt{t} scaling of increments (not linear in tt) is characteristic.

The Ito Integral

Ordinary calculus defines 0Tf(t)dg(t)\int_0^T f(t) \, dg(t) when gg has bounded variation. Brownian paths have unbounded variation (they wiggle too much), so this definition fails.

Definition

Ito Integral

For an adapted process ftf_t satisfying E[0Tft2dt]<\mathbb{E}[\int_0^T f_t^2 \, dt] < \infty, the Ito integral is:

0TftdWt=limni=0n1fti(Wti+1Wti)\int_0^T f_t \, dW_t = \lim_{n \to \infty} \sum_{i=0}^{n-1} f_{t_i}(W_{t_{i+1}} - W_{t_i})

The limit is in L2L^2. The crucial feature: ff is evaluated at the left endpoint tit_i, not the right endpoint or midpoint. This choice makes the integral a martingale but means Ito's calculus differs from Stratonovich's.

The left-endpoint evaluation is not arbitrary. It ensures that the integrand is independent of the increment Wti+1WtiW_{t_{i+1}} - W_{t_i}, which is required for the integral to be a martingale and for the Ito isometry to hold.

Main Theorems

Theorem

Ito Isometry

Statement

E[(0TftdWt)2]=E[0Tft2dt]\mathbb{E}\left[\left(\int_0^T f_t \, dW_t\right)^2\right] = \mathbb{E}\left[\int_0^T f_t^2 \, dt\right]

Intuition

The Ito integral converts an L2L^2 function of time into a random variable, and it does so isometrically: the variance of the integral equals the integral of the variance. This is because the cross-terms in the square vanish via adaptedness plus the martingale property of Brownian increments (conditioning on the earlier sigma-algebra zeros out the forward increment).

Proof Sketch

For simple (step) functions, expand the square of the sum. Cross-terms have the form ftiftj(Wti+1Wti)(Wtj+1Wtj)f_{t_i} f_{t_j} (W_{t_{i+1}} - W_{t_i})(W_{t_{j+1}} - W_{t_j}) with i<ji < j. These vanish by conditioning on Ftj\mathcal{F}_{t_j}: adaptedness makes ftif_{t_i}, ftjf_{t_j}, and (Wti+1Wti)(W_{t_{i+1}} - W_{t_i}) all Ftj\mathcal{F}_{t_j}-measurable, and the forward increment (Wtj+1Wtj)(W_{t_{j+1}} - W_{t_j}) has zero conditional mean by the martingale property of Brownian motion. Only the diagonal terms survive, giving iE[fti2](ti+1ti)\sum_i \mathbb{E}[f_{t_i}^2] \cdot (t_{i+1} - t_i). Extend by density to general integrands.

Why It Matters

The Ito isometry is the tool for computing variances of stochastic integrals. It also provides the L2L^2 framework needed to define the integral for general integrands by approximation with simple processes.

Failure Mode

The isometry fails if ftf_t is not adapted (i.e., it looks into the future). It also fails for the Stratonovich integral, where the integrand is evaluated at the midpoint rather than the left endpoint.

Theorem

Ito's Lemma

Statement

If XtX_t satisfies dXt=μtdt+σtdWtdX_t = \mu_t \, dt + \sigma_t \, dW_t and fC2f \in C^2, then:

df(Xt)=f(Xt)dXt+12f(Xt)σt2dtdf(X_t) = f'(X_t) \, dX_t + \frac{1}{2} f''(X_t) \sigma_t^2 \, dt

Equivalently:

df(Xt)=[μtf(Xt)+12σt2f(Xt)]dt+σtf(Xt)dWtdf(X_t) = \left[\mu_t f'(X_t) + \frac{1}{2} \sigma_t^2 f''(X_t)\right] dt + \sigma_t f'(X_t) \, dW_t

Intuition

This is the chain rule for stochastic processes. In ordinary calculus, df(x)=f(x)dxdf(x) = f'(x) dx and we stop at first order because (dx)2(dx)^2 is negligible. For Brownian motion, (dWt)2=dt(dW_t)^2 = dt (heuristically), so the second-order Taylor term 12f(Xt)(dXt)2\frac{1}{2} f''(X_t)(dX_t)^2 contributes a non-negligible dtdt term. This is the extra correction.

Proof Sketch

Apply a second-order Taylor expansion: f(Xt+dt)f(Xt)f(Xt)ΔX+12f(Xt)(ΔX)2f(X_{t+dt}) - f(X_t) \approx f'(X_t) \Delta X + \frac{1}{2} f''(X_t) (\Delta X)^2. Compute (ΔX)2=(μΔt+σΔW)2(\Delta X)^2 = (\mu \Delta t + \sigma \Delta W)^2. The (ΔW)2(\Delta W)^2 term converges to σ2dt\sigma^2 dt as the partition refines (quadratic variation of Brownian motion). The ΔtΔW\Delta t \cdot \Delta W and (Δt)2(\Delta t)^2 terms vanish.

Why It Matters

Ito's lemma is the computational workhorse of stochastic calculus. Every derivation involving SDEs uses it: computing the dynamics of transformed processes, deriving the Fokker-Planck equation, proving properties of diffusion models. You will use this constantly.

Failure Mode

If ff is not C2C^2, the lemma does not apply in its standard form (though generalizations exist via Tanaka's formula). If the underlying process is not an Ito process (e.g., it has jumps), you need the jump-diffusion version of Ito's lemma.

Stochastic Differential Equations

Definition

Stochastic Differential Equation

An SDE has the form:

dXt=μ(Xt,t)dt+σ(Xt,t)dWtdX_t = \mu(X_t, t) \, dt + \sigma(X_t, t) \, dW_t

where μ\mu is the drift (deterministic tendency) and σ\sigma is the diffusion coefficient (noise intensity). This is shorthand for the integral equation Xt=X0+0tμ(Xs,s)ds+0tσ(Xs,s)dWsX_t = X_0 + \int_0^t \mu(X_s, s) ds + \int_0^t \sigma(X_s, s) dW_s.

Theorem

Existence and Uniqueness for SDEs

Statement

Under Lipschitz and linear growth conditions on μ\mu and σ\sigma, the SDE has a unique strong solution XtX_t that is adapted to the filtration generated by WtW_t and satisfies E[sup0tTXt2]<\mathbb{E}[\sup_{0 \leq t \leq T} X_t^2] < \infty.

Intuition

This is the stochastic analog of Picard-Lindelof for ODEs. Lipschitz continuity prevents solutions from splitting apart (uniqueness), and linear growth prevents solutions from exploding to infinity in finite time (existence).

Proof Sketch

Picard iteration: define Xt(0)=X0X_t^{(0)} = X_0 and Xt(n+1)=X0+0tμ(Xs(n),s)ds+0tσ(Xs(n),s)dWsX_t^{(n+1)} = X_0 + \int_0^t \mu(X_s^{(n)}, s) ds + \int_0^t \sigma(X_s^{(n)}, s) dW_s. Use the Ito isometry and the Lipschitz condition to show that the iterates form a Cauchy sequence in L2(sup[0,T])L^2(\sup_{[0,T]}). Completeness gives convergence to a unique limit.

Why It Matters

This theorem guarantees that the forward and reverse SDEs in diffusion models are well-defined. Without existence and uniqueness, the generative process would not be mathematically sound.

Failure Mode

Many practically important SDEs violate Lipschitz continuity. The CIR process (dXt=a(bXt)dt+σXtdWtdX_t = a(b - X_t) dt + \sigma \sqrt{X_t} dW_t) has σ(x)=σx\sigma(x) = \sigma\sqrt{x}, which is not Lipschitz at x=0x = 0. In such cases, existence and uniqueness can still be established by other methods, but the standard theorem does not apply directly.

Connections to ML

Diffusion models: the forward process dXt=f(t)Xtdt+g(t)dWtdX_t = f(t) X_t \, dt + g(t) \, dW_t gradually adds noise to data. The reverse process (Anderson, 1982) is also an SDE:

dXt=[f(t)Xtg(t)2xlogpt(Xt)]dt+g(t)dWˉtdX_t = [f(t)X_t - g(t)^2 \nabla_x \log p_t(X_t)] \, dt + g(t) \, d\bar{W}_t

where xlogpt\nabla_x \log p_t is the score function. The neural network learns to approximate this score. The forward SDE's marginal pt(x)p_t(x) satisfies the Fokker-Planck equation; the reverse SDE's existence relies on Anderson's time-reversal theorem, which requires xlogpt\nabla_x \log p_t to be well-defined (finite Fisher information at each time).

Langevin dynamics: the SDE dXt=logp(Xt)dt+2dWtdX_t = \nabla \log p(X_t) \, dt + \sqrt{2} \, dW_t has pp as its stationary distribution under mild conditions (log-concavity gives exponential convergence; without it, convergence can be arbitrarily slow). Discretizing with step size η\eta gives the unadjusted Langevin algorithm (ULA); the discretization error is O(η)O(\eta) in KL divergence, which is why Metropolis correction (MALA) or decreasing step sizes are needed for exact sampling.

SGD as SDE: with small learning rate η\eta, SGD on a loss LL approximately follows dXt=L(Xt)dt+ηΣ(Xt)1/2dWtdX_t = -\nabla L(X_t) \, dt + \sqrt{\eta} \, \Sigma(X_t)^{1/2} \, dW_t, where Σ(x)\Sigma(x) is the (per-step) minibatch gradient covariance. Equivalently, the infinitesimal generator has diffusion tensor ηΣ(x)\eta \, \Sigma(x), but the amplitude of the noise scales like η\sqrt{\eta}, not η\eta. Li et al. (2017) showed this approximation is valid up to O(η)O(\eta) in weak error after the standard time rescaling t=kηt = k\eta (so one continuous-time unit corresponds to 1/η1/\eta SGD steps). The SDE viewpoint explains why SGD with larger η\eta finds flatter minima: the noise amplitude ηΣ1/2\sqrt{\eta} \, \Sigma^{1/2} pushes iterates out of sharp basins.

Geometric Brownian motion: the SDE dSt=μStdt+σStdWtdS_t = \mu S_t \, dt + \sigma S_t \, dW_t models stock prices in the Black-Scholes framework. Applying Ito's lemma to logSt\log S_t gives d(logSt)=(μσ2/2)dt+σdWtd(\log S_t) = (\mu - \sigma^2/2) dt + \sigma \, dW_t, so St=S0exp((μσ2/2)t+σWt)S_t = S_0 \exp((\mu - \sigma^2/2)t + \sigma W_t). The σ2/2-\sigma^2/2 correction is a direct consequence of the Ito second-order term.

Common Confusions

Watch Out

Ito vs Stratonovich

The Ito integral evaluates the integrand at the left endpoint; the Stratonovich integral evaluates at the midpoint. They give different results for the same integrand. Ito is standard in probability and finance because Ito integrals are martingales. Stratonovich is common in physics because it preserves the ordinary chain rule. For diffusion models in ML, the Ito convention is standard.

Watch Out

dW_t squared is not zero

In ordinary calculus, (dx)2=0(dx)^2 = 0 because it is second-order. In stochastic calculus, (dWt)2=dt(dW_t)^2 = dt (in the formal sense of quadratic variation). This is why Ito's lemma has an extra term. Forgetting this is the most common error.

Summary

  • Brownian motion has continuous but nowhere differentiable paths
  • Ito integrals use left-endpoint evaluation, making them martingales
  • Ito's lemma: df=fdX+12fσ2dtdf = f' dX + \frac{1}{2} f'' \sigma^2 dt (the extra second-order term is the key difference from ordinary calculus)
  • SDEs exist and are unique under Lipschitz and linear growth conditions
  • Diffusion models, Langevin dynamics, and SGD analysis all use SDEs

Exercises

ExerciseCore

Problem

Let WtW_t be a standard Brownian motion. Use Ito's lemma to find d(Wt2)d(W_t^2). What are the drift and diffusion coefficients of the process Yt=Wt2Y_t = W_t^2?

ExerciseAdvanced

Problem

The Ornstein-Uhlenbeck process satisfies dXt=θXtdt+σdWtdX_t = -\theta X_t \, dt + \sigma \, dW_t with θ>0\theta > 0. Use Ito's lemma on Yt=XteθtY_t = X_t e^{\theta t} to find the explicit solution for XtX_t.

References

Canonical:

  • Oksendal, Stochastic Differential Equations (6th ed., 2003), Chapters 3-5 (Ito integral, Ito's formula, SDEs)
  • Karatzas & Shreve, Brownian Motion and Stochastic Calculus (2nd ed., 1991), Chapter 3 (Ito integral construction and isometry)
  • Revuz & Yor, Continuous Martingales and Brownian Motion (3rd ed., 2005), Chapters IV-V (Ito calculus in full generality)

Current:

  • Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, Score-Based Generative Modeling through Stochastic Differential Equations, ICLR 2021, arXiv:2011.13456, Section 2
  • Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models, NeurIPS 2020, arXiv:2006.11239
  • Li, Tai, E, Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations, J. Mach. Learn. Res. 20(40), 2019, pp. 1-47
  • Anderson, Reverse-Time Diffusion Equation Models, Stochastic Processes and their Applications 12(3), 1982, pp. 313-326

Next Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

7

+2 more on the derived-topics page.