Mathematical Infrastructure
Stochastic Calculus for ML
Brownian motion, Ito integrals, Ito's lemma, and stochastic differential equations: the mathematical machinery behind diffusion models, score-based generative models, and Langevin dynamics.
Why This Matters
Diffusion models (DDPM, score-based models) generate data by reversing a stochastic differential equation. Langevin dynamics, used in MCMC sampling and score matching, is a specific SDE. Understanding these models requires stochastic calculus: the extension of ordinary calculus to processes driven by Brownian motion.
The central surprise of stochastic calculus is that the chain rule changes. When you apply a smooth function to a process driven by Brownian motion, you pick up an extra second-order term that does not appear in ordinary calculus. This is Ito's lemma, and it is the single most important formula in this topic.
Brownian Motion
Standard Brownian Motion
A continuous-time stochastic process satisfying:
- Independent increments: is independent of for
- Gaussian increments:
- Continuous paths: is continuous almost surely
Key properties that make Brownian motion different from smooth functions:
- Nowhere differentiable: is continuous but has no derivative at any point, almost surely. This is why Riemann-Stieltjes integration fails.
- Quadratic variation: as the partition becomes finer. For smooth functions, quadratic variation is zero. This nonzero quadratic variation is the source of the extra term in Ito's lemma.
- Scaling: is also a standard Brownian motion. The scaling of increments (not linear in ) is characteristic.
The Ito Integral
Ordinary calculus defines when has bounded variation. Brownian paths have unbounded variation (they wiggle too much), so this definition fails.
Ito Integral
For an adapted process satisfying , the Ito integral is:
The limit is in . The crucial feature: is evaluated at the left endpoint , not the right endpoint or midpoint. This choice makes the integral a martingale but means Ito's calculus differs from Stratonovich's.
The left-endpoint evaluation is not arbitrary. It ensures that the integrand is independent of the increment , which is required for the integral to be a martingale and for the Ito isometry to hold.
Main Theorems
Ito Isometry
Statement
Intuition
The Ito integral converts an function of time into a random variable, and it does so isometrically: the variance of the integral equals the integral of the variance. This is because the cross-terms in the square vanish via adaptedness plus the martingale property of Brownian increments (conditioning on the earlier sigma-algebra zeros out the forward increment).
Proof Sketch
For simple (step) functions, expand the square of the sum. Cross-terms have the form with . These vanish by conditioning on : adaptedness makes , , and all -measurable, and the forward increment has zero conditional mean by the martingale property of Brownian motion. Only the diagonal terms survive, giving . Extend by density to general integrands.
Why It Matters
The Ito isometry is the tool for computing variances of stochastic integrals. It also provides the framework needed to define the integral for general integrands by approximation with simple processes.
Failure Mode
The isometry fails if is not adapted (i.e., it looks into the future). It also fails for the Stratonovich integral, where the integrand is evaluated at the midpoint rather than the left endpoint.
Ito's Lemma
Statement
If satisfies and , then:
Equivalently:
Intuition
This is the chain rule for stochastic processes. In ordinary calculus, and we stop at first order because is negligible. For Brownian motion, (heuristically), so the second-order Taylor term contributes a non-negligible term. This is the extra correction.
Proof Sketch
Apply a second-order Taylor expansion: . Compute . The term converges to as the partition refines (quadratic variation of Brownian motion). The and terms vanish.
Why It Matters
Ito's lemma is the computational workhorse of stochastic calculus. Every derivation involving SDEs uses it: computing the dynamics of transformed processes, deriving the Fokker-Planck equation, proving properties of diffusion models. You will use this constantly.
Failure Mode
If is not , the lemma does not apply in its standard form (though generalizations exist via Tanaka's formula). If the underlying process is not an Ito process (e.g., it has jumps), you need the jump-diffusion version of Ito's lemma.
Stochastic Differential Equations
Stochastic Differential Equation
An SDE has the form:
where is the drift (deterministic tendency) and is the diffusion coefficient (noise intensity). This is shorthand for the integral equation .
Existence and Uniqueness for SDEs
Statement
Under Lipschitz and linear growth conditions on and , the SDE has a unique strong solution that is adapted to the filtration generated by and satisfies .
Intuition
This is the stochastic analog of Picard-Lindelof for ODEs. Lipschitz continuity prevents solutions from splitting apart (uniqueness), and linear growth prevents solutions from exploding to infinity in finite time (existence).
Proof Sketch
Picard iteration: define and . Use the Ito isometry and the Lipschitz condition to show that the iterates form a Cauchy sequence in . Completeness gives convergence to a unique limit.
Why It Matters
This theorem guarantees that the forward and reverse SDEs in diffusion models are well-defined. Without existence and uniqueness, the generative process would not be mathematically sound.
Failure Mode
Many practically important SDEs violate Lipschitz continuity. The CIR process () has , which is not Lipschitz at . In such cases, existence and uniqueness can still be established by other methods, but the standard theorem does not apply directly.
Connections to ML
Diffusion models: the forward process gradually adds noise to data. The reverse process (Anderson, 1982) is also an SDE:
where is the score function. The neural network learns to approximate this score. The forward SDE's marginal satisfies the Fokker-Planck equation; the reverse SDE's existence relies on Anderson's time-reversal theorem, which requires to be well-defined (finite Fisher information at each time).
Langevin dynamics: the SDE has as its stationary distribution under mild conditions (log-concavity gives exponential convergence; without it, convergence can be arbitrarily slow). Discretizing with step size gives the unadjusted Langevin algorithm (ULA); the discretization error is in KL divergence, which is why Metropolis correction (MALA) or decreasing step sizes are needed for exact sampling.
SGD as SDE: with small learning rate , SGD on a loss approximately follows , where is the (per-step) minibatch gradient covariance. Equivalently, the infinitesimal generator has diffusion tensor , but the amplitude of the noise scales like , not . Li et al. (2017) showed this approximation is valid up to in weak error after the standard time rescaling (so one continuous-time unit corresponds to SGD steps). The SDE viewpoint explains why SGD with larger finds flatter minima: the noise amplitude pushes iterates out of sharp basins.
Geometric Brownian motion: the SDE models stock prices in the Black-Scholes framework. Applying Ito's lemma to gives , so . The correction is a direct consequence of the Ito second-order term.
Common Confusions
Ito vs Stratonovich
The Ito integral evaluates the integrand at the left endpoint; the Stratonovich integral evaluates at the midpoint. They give different results for the same integrand. Ito is standard in probability and finance because Ito integrals are martingales. Stratonovich is common in physics because it preserves the ordinary chain rule. For diffusion models in ML, the Ito convention is standard.
dW_t squared is not zero
In ordinary calculus, because it is second-order. In stochastic calculus, (in the formal sense of quadratic variation). This is why Ito's lemma has an extra term. Forgetting this is the most common error.
Summary
- Brownian motion has continuous but nowhere differentiable paths
- Ito integrals use left-endpoint evaluation, making them martingales
- Ito's lemma: (the extra second-order term is the key difference from ordinary calculus)
- SDEs exist and are unique under Lipschitz and linear growth conditions
- Diffusion models, Langevin dynamics, and SGD analysis all use SDEs
Exercises
Problem
Let be a standard Brownian motion. Use Ito's lemma to find . What are the drift and diffusion coefficients of the process ?
Problem
The Ornstein-Uhlenbeck process satisfies with . Use Ito's lemma on to find the explicit solution for .
References
Canonical:
- Oksendal, Stochastic Differential Equations (6th ed., 2003), Chapters 3-5 (Ito integral, Ito's formula, SDEs)
- Karatzas & Shreve, Brownian Motion and Stochastic Calculus (2nd ed., 1991), Chapter 3 (Ito integral construction and isometry)
- Revuz & Yor, Continuous Martingales and Brownian Motion (3rd ed., 2005), Chapters IV-V (Ito calculus in full generality)
Current:
- Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, Score-Based Generative Modeling through Stochastic Differential Equations, ICLR 2021, arXiv:2011.13456, Section 2
- Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models, NeurIPS 2020, arXiv:2006.11239
- Li, Tai, E, Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations, J. Mach. Learn. Res. 20(40), 2019, pp. 1-47
- Anderson, Reverse-Time Diffusion Equation Models, Stochastic Processes and their Applications 12(3), 1982, pp. 313-326
Next Topics
- Diffusion models: the primary ML application of SDEs and score functions
- Langevin dynamics: SDE-based MCMC sampling
- SGD as SDE: continuous-time analysis of stochastic gradient descent
- Score matching: learning the score function that drives the reverse SDE
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Measure-Theoretic Probabilitylayer 0B · tier 1
- Classical ODEs: Existence, Stability, and Numerical Methodslayer 1 · tier 1
- Martingale Theorylayer 0B · tier 2
Derived topics
7- Score Matchinglayer 3 · tier 1
- Diffusion Modelslayer 4 · tier 1
- Ito's Lemmalayer 3 · tier 2
- Langevin Dynamicslayer 3 · tier 2
- SGD as a Stochastic Differential Equationlayer 3 · tier 2
+2 more on the derived-topics page.