Stochastic Calculus for ML

Sneiderman, Robby

Mathematical Infrastructure

Stochastic Calculus for ML

Brownian motion, Ito integrals, Ito's lemma, and stochastic differential equations: the mathematical machinery behind diffusion models, score-based generative models, and Langevin dynamics.

AdvancedTier 3StableSupporting~35 min

Prerequisites

Martingale Theory Measure Theoretic Probability Classical Odes

Quiz (5)Pulse Check Prereq Map

Why This Matters

Diffusion models (DDPM, score-based models) generate data by reversing a stochastic differential equation. Langevin dynamics, used in MCMC sampling and score matching, is a specific SDE. Understanding these models requires stochastic calculus: the extension of ordinary calculus to processes driven by Brownian motion.

The central surprise of stochastic calculus is that the chain rule changes. When you apply a smooth function to a process driven by Brownian motion, you pick up an extra second-order term that does not appear in ordinary calculus. This is Ito's lemma, and it is the single most important formula in this topic.

Brownian Motion

Definition

Standard Brownian Motion $W_{t}$

A continuous-time stochastic process $\{W_t\}_{t \geq 0}$ satisfying:

$W_0 = 0$
Independent increments: $W_t - W_s$ is independent of $\{W_u : u \leq s\}$ for $s < t$
Gaussian increments: $W_t - W_s \sim \mathcal{N}(0, t - s)$
Continuous paths: $t \mapsto W_t$ is continuous almost surely

Key properties that make Brownian motion different from smooth functions:

Nowhere differentiable: $W_t$ is continuous but has no derivative at any point, almost surely. This is why Riemann-Stieltjes integration fails.
Quadratic variation: $\sum_{i} (W_{t_{i+1}} - W_{t_i})^2 \to t$ as the partition becomes finer. For smooth functions, quadratic variation is zero. This nonzero quadratic variation is the source of the extra term in Ito's lemma.
Scaling: $\{c^{-1/2} W_{ct}\}_{t \geq 0}$ is also a standard Brownian motion. The $\sqrt{t}$ scaling of increments (not linear in $t$ ) is characteristic.

The Ito Integral

Ordinary calculus defines $\int_0^T f(t) \, dg(t)$ when $g$ has bounded variation. Brownian paths have unbounded variation (they wiggle too much), so this definition fails.

Definition

Ito Integral

For an adapted process $f_t$ satisfying $\mathbb{E}[\int_0^T f_t^2 \, dt] < \infty$ , the Ito integral is:

$\int_0^T f_t \, dW_t = \lim_{n \to \infty} \sum_{i=0}^{n-1} f_{t_i}(W_{t_{i+1}} - W_{t_i})$

The limit is in $L^2$ . The crucial feature: $f$ is evaluated at the left endpoint $t_i$ , not the right endpoint or midpoint. This choice makes the integral a martingale but means Ito's calculus differs from Stratonovich's.

The left-endpoint evaluation is not arbitrary. It ensures that the integrand is independent of the increment $W_{t_{i+1}} - W_{t_i}$ , which is required for the integral to be a martingale and for the Ito isometry to hold.

Main Theorems

Theorem

Ito Isometry

Statement

$\mathbb{E}\left[\left(\int_0^T f_t \, dW_t\right)^2\right] = \mathbb{E}\left[\int_0^T f_t^2 \, dt\right]$

Intuition

The Ito integral converts an $L^2$ function of time into a random variable, and it does so isometrically: the variance of the integral equals the integral of the variance. This is because the cross-terms in the square vanish via adaptedness plus the martingale property of Brownian increments (conditioning on the earlier sigma-algebra zeros out the forward increment).

Proof Sketch

For simple (step) functions, expand the square of the sum. Cross-terms have the form $f_{t_i} f_{t_j} (W_{t_{i+1}} - W_{t_i})(W_{t_{j+1}} - W_{t_j})$ with $i < j$ . These vanish by conditioning on $\mathcal{F}_{t_j}$ : adaptedness makes $f_{t_i}$ , $f_{t_j}$ , and $(W_{t_{i+1}} - W_{t_i})$ all $\mathcal{F}_{t_j}$ -measurable, and the forward increment $(W_{t_{j+1}} - W_{t_j})$ has zero conditional mean by the martingale property of Brownian motion. Only the diagonal terms survive, giving $\sum_i \mathbb{E}[f_{t_i}^2] \cdot (t_{i+1} - t_i)$ . Extend by density to general integrands.

Why It Matters

The Ito isometry is the tool for computing variances of stochastic integrals. It also provides the $L^2$ framework needed to define the integral for general integrands by approximation with simple processes.

Failure Mode

The isometry fails if $f_t$ is not adapted (i.e., it looks into the future). It also fails for the Stratonovich integral, where the integrand is evaluated at the midpoint rather than the left endpoint.

report a correction →

Theorem

Ito's Lemma

Statement

If $X_t$ satisfies $dX_t = \mu_t \, dt + \sigma_t \, dW_t$ and $f \in C^2$ , then:

$df(X_t) = f'(X_t) \, dX_t + \frac{1}{2} f''(X_t) \sigma_t^2 \, dt$

Equivalently:

$df(X_t) = \left[\mu_t f'(X_t) + \frac{1}{2} \sigma_t^2 f''(X_t)\right] dt + \sigma_t f'(X_t) \, dW_t$

Intuition

This is the chain rule for stochastic processes. In ordinary calculus, $df(x) = f'(x) dx$ and we stop at first order because $(dx)^2$ is negligible. For Brownian motion, $(dW_t)^2 = dt$ (heuristically), so the second-order Taylor term $\frac{1}{2} f''(X_t)(dX_t)^2$ contributes a non-negligible $dt$ term. This is the extra correction.

Proof Sketch

Apply a second-order Taylor expansion: $f(X_{t+dt}) - f(X_t) \approx f'(X_t) \Delta X + \frac{1}{2} f''(X_t) (\Delta X)^2$ . Compute $(\Delta X)^2 = (\mu \Delta t + \sigma \Delta W)^2$ . The $(\Delta W)^2$ term converges to $\sigma^2 dt$ as the partition refines (quadratic variation of Brownian motion). The $\Delta t \cdot \Delta W$ and $(\Delta t)^2$ terms vanish.

Why It Matters

Ito's lemma is the computational workhorse of stochastic calculus. Every derivation involving SDEs uses it: computing the dynamics of transformed processes, deriving the Fokker-Planck equation, proving properties of diffusion models. You will use this constantly.

Failure Mode

If $f$ is not $C^2$ , the lemma does not apply in its standard form (though generalizations exist via Tanaka's formula). If the underlying process is not an Ito process (e.g., it has jumps), you need the jump-diffusion version of Ito's lemma.

report a correction →

Stochastic Differential Equations

Definition

Stochastic Differential Equation $S D E$

An SDE has the form:

$dX_t = \mu(X_t, t) \, dt + \sigma(X_t, t) \, dW_t$

where $\mu$ is the drift (deterministic tendency) and $\sigma$ is the diffusion coefficient (noise intensity). This is shorthand for the integral equation $X_t = X_0 + \int_0^t \mu(X_s, s) ds + \int_0^t \sigma(X_s, s) dW_s$ .

Theorem

Existence and Uniqueness for SDEs

Statement

Under Lipschitz and linear growth conditions on $\mu$ and $\sigma$ , the SDE has a unique strong solution $X_t$ that is adapted to the filtration generated by $W_t$ and satisfies $\mathbb{E}[\sup_{0 \leq t \leq T} X_t^2] < \infty$ .

Intuition

This is the stochastic analog of Picard-Lindelof for ODEs. Lipschitz continuity prevents solutions from splitting apart (uniqueness), and linear growth prevents solutions from exploding to infinity in finite time (existence).

Proof Sketch

Picard iteration: define $X_t^{(0)} = X_0$ and $X_t^{(n+1)} = X_0 + \int_0^t \mu(X_s^{(n)}, s) ds + \int_0^t \sigma(X_s^{(n)}, s) dW_s$ . Use the Ito isometry and the Lipschitz condition to show that the iterates form a Cauchy sequence in $L^2(\sup_{[0,T]})$ . Completeness gives convergence to a unique limit.

Why It Matters

This theorem guarantees that the forward and reverse SDEs in diffusion models are well-defined. Without existence and uniqueness, the generative process would not be mathematically sound.

Failure Mode

Many practically important SDEs violate Lipschitz continuity. The CIR process ( $dX_t = a(b - X_t) dt + \sigma \sqrt{X_t} dW_t$ ) has $\sigma(x) = \sigma\sqrt{x}$ , which is not Lipschitz at $x = 0$ . In such cases, existence and uniqueness can still be established by other methods, but the standard theorem does not apply directly.

report a correction →

Connections to ML

Diffusion models: the forward process $dX_t = f(t) X_t \, dt + g(t) \, dW_t$ gradually adds noise to data. The reverse process (Anderson, 1982) is also an SDE:

$dX_t = [f(t)X_t - g(t)^2 \nabla_x \log p_t(X_t)] \, dt + g(t) \, d\bar{W}_t$

where $\nabla_x \log p_t$ is the score function. The neural network learns to approximate this score. The forward SDE's marginal $p_t(x)$ satisfies the Fokker-Planck equation; the reverse SDE's existence relies on Anderson's time-reversal theorem, which requires $\nabla_x \log p_t$ to be well-defined (finite Fisher information at each time).

Langevin dynamics: the SDE $dX_t = \nabla \log p(X_t) \, dt + \sqrt{2} \, dW_t$ has $p$ as its stationary distribution under mild conditions (log-concavity gives exponential convergence; without it, convergence can be arbitrarily slow). Discretizing with step size $\eta$ gives the unadjusted Langevin algorithm (ULA); the discretization error is $O(\eta)$ in KL divergence, which is why Metropolis correction (MALA) or decreasing step sizes are needed for exact sampling.

SGD as SDE: with small learning rate $\eta$ , SGD on a loss $L$ approximately follows $dX_t = -\nabla L(X_t) \, dt + \sqrt{\eta} \, \Sigma(X_t)^{1/2} \, dW_t$ , where $\Sigma(x)$ is the (per-step) minibatch gradient covariance. Equivalently, the infinitesimal generator has diffusion tensor $\eta \, \Sigma(x)$ , but the amplitude of the noise scales like $\sqrt{\eta}$ , not $\eta$ . Li et al. (2017) showed this approximation is valid up to $O(\eta)$ in weak error after the standard time rescaling $t = k\eta$ (so one continuous-time unit corresponds to $1/\eta$ SGD steps). The SDE viewpoint explains why SGD with larger $\eta$ finds flatter minima: the noise amplitude $\sqrt{\eta} \, \Sigma^{1/2}$ pushes iterates out of sharp basins.

Geometric Brownian motion: the SDE $dS_t = \mu S_t \, dt + \sigma S_t \, dW_t$ models stock prices in the Black-Scholes framework. Applying Ito's lemma to $\log S_t$ gives $d(\log S_t) = (\mu - \sigma^2/2) dt + \sigma \, dW_t$ , so $S_t = S_0 \exp((\mu - \sigma^2/2)t + \sigma W_t)$ . The $-\sigma^2/2$ correction is a direct consequence of the Ito second-order term.

Common Confusions

Watch Out

Ito vs Stratonovich

The Ito integral evaluates the integrand at the left endpoint; the Stratonovich integral evaluates at the midpoint. They give different results for the same integrand. Ito is standard in probability and finance because Ito integrals are martingales. Stratonovich is common in physics because it preserves the ordinary chain rule. For diffusion models in ML, the Ito convention is standard.

Watch Out

dW_t squared is not zero

In ordinary calculus, $(dx)^2 = 0$ because it is second-order. In stochastic calculus, $(dW_t)^2 = dt$ (in the formal sense of quadratic variation). This is why Ito's lemma has an extra term. Forgetting this is the most common error.

Summary

Brownian motion has continuous but nowhere differentiable paths
Ito integrals use left-endpoint evaluation, making them martingales
Ito's lemma: $df = f' dX + \frac{1}{2} f'' \sigma^2 dt$ (the extra second-order term is the key difference from ordinary calculus)
SDEs exist and are unique under Lipschitz and linear growth conditions
Diffusion models, Langevin dynamics, and SGD analysis all use SDEs

Exercises

ExerciseCore

Problem

Let $W_t$ be a standard Brownian motion. Use Ito's lemma to find $d(W_t^2)$ . What are the drift and diffusion coefficients of the process $Y_t = W_t^2$ ?

ExerciseAdvanced

Problem

The Ornstein-Uhlenbeck process satisfies $dX_t = -\theta X_t \, dt + \sigma \, dW_t$ with $\theta > 0$ . Use Ito's lemma on $Y_t = X_t e^{\theta t}$ to find the explicit solution for $X_t$ .

References

Canonical:

Oksendal, Stochastic Differential Equations (6th ed., 2003), Chapters 3-5 (Ito integral, Ito's formula, SDEs)
Karatzas & Shreve, Brownian Motion and Stochastic Calculus (2nd ed., 1991), Chapter 3 (Ito integral construction and isometry)
Revuz & Yor, Continuous Martingales and Brownian Motion (3rd ed., 2005), Chapters IV-V (Ito calculus in full generality)

Current:

Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, Score-Based Generative Modeling through Stochastic Differential Equations, ICLR 2021, arXiv:2011.13456, Section 2
Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models, NeurIPS 2020, arXiv:2006.11239
Li, Tai, E, Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations, J. Mach. Learn. Res. 20(40), 2019, pp. 1-47
Anderson, Reverse-Time Diffusion Equation Models, Stochastic Processes and their Applications 12(3), 1982, pp. 313-326

Next Topics

Diffusion models: the primary ML application of SDEs and score functions
Langevin dynamics: SDE-based MCMC sampling
SGD as SDE: continuous-time analysis of stochastic gradient descent
Score matching: learning the score function that drives the reverse SDE

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Measure-Theoretic Probabilitylayer 0B · tier 1
Classical ODEs: Existence, Stability, and Numerical Methodslayer 1 · tier 1
Martingale Theorylayer 0B · tier 2

Derived topics

7

Score Matchinglayer 3 · tier 1
Diffusion Modelslayer 4 · tier 1
Ito's Lemmalayer 3 · tier 2
Langevin Dynamicslayer 3 · tier 2
SGD as a Stochastic Differential Equationlayer 3 · tier 2

+2 more on the derived-topics page.

Graph-backed continuations

Diffusion Models Langevin Dynamics SGD as a Stochastic Differential Equation Score Matching Ito's Lemma Neural SDEs and the Diffusion Bridge Stochastic Differential Equations