Score Matching

Sneiderman, Robby

ML Methods

Score Matching

Hyvärinen 2005: train a model to estimate the score (gradient of log density) without computing the normalization constant. Integration by parts converts the intractable density-matching loss into a tractable gradient-based objective. Sliced score matching makes the Jacobian-trace term scale, denoising score matching reparameterizes the loss as ε-regression, and Tweedie's formula identifies the score with a posterior-mean denoiser. Together these are the training half of every modern diffusion model and energy-based model.

AdvancedTier 1StableCore spine~80 min

Prerequisites

Stochastic Differential Equations Fokker Planck Equation Expectation Variance Covariance Moments Fisher Information

Quiz (6)Pulse Check Prereq Map

Why This Matters

Most expressive probability models cannot be trained by maximum likelihood because the normalization constant $Z(\theta) = \int e^{-E_\theta(x)}\,dx$ is intractable. Energy-based models, score-based diffusion, and most generative samplers all face this obstacle. Score matching (Hyvärinen 2005) is the workaround: instead of fitting the density $p_\theta(x)$ directly, fit the score $\nabla_x \log p(x)$ . The score is invariant to the partition function: $\nabla_x \log Z(\theta) = 0$ , so matching scores never requires knowing $Z$ .

Hyvärinen turned this idea into a tractable loss via integration by parts: the squared-distance objective between model and data scores reduces, up to a $\theta$ -independent constant, to a quantity computable from samples plus a single Jacobian-trace term. Vincent (2011) replaced the Jacobian trace with a denoising target, giving the denoising score matching loss that scales to high dimensions. Song, Garg, Shi, and Ermon (2020) gave the sliced variant that estimates the trace stochastically, recovering tractability without any noise injection. Robbins (1956) and Efron (2011) supply the third piece, Tweedie's formula, which identifies the score of a Gaussian-noised density with a posterior-mean denoiser, fixing the operational meaning of what a trained score network actually represents.

Song and Ermon (2019) added multiple noise scales and turned the loss into the noise-conditional score network (NCSN); Song et al. (2021) wrapped the whole setup inside a forward noising SDE and used Anderson's time-reversal theorem to turn the learned score into a generative sampler. The thread from Hyvärinen 2005 to Stable Diffusion is straight: every score-based generative model is doing denoising score matching at multiple noise levels, then plugging the learned score into a reverse SDE or probability-flow ODE. If you understand score matching, you understand the training half of diffusion models.

Mental Model

The score $s(x) := \nabla_x \log p(x)$ is a vector field on $\mathbb{R}^d$ that points from low-density regions toward high-density regions. It is the direction of steepest log-density increase and the gradient field that Langevin dynamics follows to sample from $p$ : $dX_t = s(X_t)\,dt + \sqrt{2}\,dW_t$ has $p$ as its stationary distribution.

Score matching trains a parametric vector field $s_\theta(x) \in \mathbb{R}^d$ to approximate this field. The natural loss is the squared $L^2$ error between data and model scores, weighted by the data density:

J_{\text{ESM}}(\theta) = \tfrac{1}{2}\, \mathbb{E}_{x \sim p_{\text{data}}} \big[\, \|s_\theta(x) - \nabla_x \log p_{\text{data}}(x)\|^2\, \big].

This is the explicit score matching loss. It is intractable because we do not know $\nabla_x \log p_{\text{data}}$ . The story of score matching is the story of progressively more practical surrogates. The first three (ESM, ISM, SSM) all target the same population minimizer — the score of the clean data density $p_{\text{data}}$ . Denoising score matching (DSM) is different: it targets the score of the noisy density $q_\sigma = p_{\text{data}} \ast \mathcal{N}(0, \sigma^2 I)$ , and only recovers the clean score in the limit $\sigma \to 0$ . So DSM is not an unbiased estimator of the ESM loss.

Estimator	Cost per sample	Targets the score of	Where to use
Explicit (ESM)	requires $\nabla \log p_{\text{data}}$	$p_{\text{data}}$	infeasible (oracle only)
Implicit (Hyvärinen)	full Jacobian trace, $O(d)$ backward passes	$p_{\text{data}}$	low-dim models, EBMs, energy parameterization
Sliced (Song et al.)	one Jacobian-vector product, $O(1)$ in $d$	$p_{\text{data}}$	medium-dim, no noise allowed
Denoising (Vincent)	one forward pass, MSE regression	$q_\sigma = p_{\text{data}} \ast \mathcal{N}(0, \sigma^2 I)$	high-dim image/audio/video; the standard

The first three rows are equivalent at the population level (modulo constants); they estimate the same target score. DSM trades that equivalence for a tractable conditional regression and learns a smoothed score; this is exactly what diffusion sampling needs, but it is a different estimand from ESM, not a cheaper unbiased estimator of it.

Score matching keeps the vector field and swaps the supervision

The object is always the same: a field that points from low-density regions toward high-density ones. The methods differ only in how they estimate or re-express the target without knowing the partition function.

oracle target

Explicit score matching asks for the inaccessible object directly

This form needs ∇ log p_data(x) as supervision. It is conceptually clean but useless in practice because the data score is exactly the thing we do not know.

Same geometry, cheaper supervision

Each variant still tries to learn the vector field that points toward high-density regions. What changes is only the training signal used to estimate that field.

Why diffusion uses the denoising row

High-dimensional image models cannot afford an exact Jacobian trace. Denoising score matching turns the problem into ordinary regression against a noisy target, which scales to modern diffusion training.

Sampling connection

Once the score field is learned, Langevin dynamics, reverse SDEs, or probability-flow ODEs can use that field to move samples back toward the data manifold.

Use the diagram as a map of the three objectives. The left panel keeps the same target geometry in view, while the right panel switches only the training signal: explicit score matching asks for the unknown data score, implicit score matching swaps that for a divergence term via integration by parts, and denoising score matching regresses on a noisy conditional score.

Hyvärinen's Implicit Score Matching

Definition

Score Function $s (x) = \nabla_{x} l o g p (x) \in R^{d}$

For a differentiable density $p$ on $\mathbb{R}^d$ , the score (or Stein score) is the gradient of its log density, $s(x) = \nabla_x \log p(x)$ . It is invariant to multiplicative constants: if $p(x) = \tilde p(x) / Z$ then $\nabla_x \log p(x) = \nabla_x \log \tilde p(x)$ . This invariance is what makes score matching a viable training signal for unnormalized models.

Definition

Implicit Score Matching Loss $J_{I S M} (θ) = E [\frac{1}{2} ∥ s_{θ} (x) ∥^{2} + tr (\nabla_{x} s_{θ} (x))]$

The implicit score matching loss for a score model $s_\theta : \mathbb{R}^d \to \mathbb{R}^d$ is

J_{\text{ISM}}(\theta) = \mathbb{E}_{x \sim p_{\text{data}}} \Big[\, \tfrac{1}{2}\, \|s_\theta(x)\|^2 + \mathrm{tr}\!\left(\nabla_x s_\theta(x)\right) \,\Big],

where $\nabla_x s_\theta(x) \in \mathbb{R}^{d \times d}$ is the Jacobian of $s_\theta$ at $x$ and $\mathrm{tr}(\nabla_x s_\theta) = \sum_{i=1}^d \partial_{x_i} (s_\theta)_i$ is its trace, the divergence of the score field. The loss is computable from samples alone. There is no reference to $\nabla_x \log p_{\text{data}}$ , no normalization constant.

Theorem

Hyvärinen's Implicit Score Matching Theorem

Statement

Under the assumptions above,

\tfrac{1}{2}\, \mathbb{E}_{p_{\text{data}}}\big[\|s_\theta(x) - \nabla_x \log p_{\text{data}}(x)\|^2\big] = \mathbb{E}_{p_{\text{data}}}\Big[\tfrac{1}{2} \|s_\theta(x)\|^2 + \mathrm{tr}(\nabla_x s_\theta(x))\Big] + C(p_{\text{data}}),

where $C(p_{\text{data}}) = \tfrac{1}{2}\, \mathbb{E}_{p_{\text{data}}} [\|\nabla_x \log p_{\text{data}}\|^2]$ is the data Fisher information, independent of $\theta$ . Minimizing $J_{\text{ISM}}$ is therefore equivalent to minimizing $J_{\text{ESM}}$ .

Intuition

The cross term $\mathbb{E}_{p_{\text{data}}}[s_\theta(x) \cdot \nabla_x \log p_{\text{data}}(x)]$ in the explicit loss is what we cannot evaluate. But $s_\theta \cdot \nabla \log p = s_\theta \cdot (\nabla p / p)$ , so $\int s_\theta \cdot (\nabla p / p)\, p\, dx = \int s_\theta \cdot \nabla p\, dx$ . Integration by parts converts $\int s_\theta \cdot \nabla p\, dx$ into $-\int (\nabla \cdot s_\theta)\, p\, dx = -\mathbb{E}_{p_{\text{data}}}[\mathrm{tr}(\nabla s_\theta)]$ . The intractable density gradient becomes a tractable divergence of the model itself.

Proof Sketch

Expand the squared-distance loss: $\|s_\theta - \nabla \log p\|^2 = \|s_\theta\|^2 - 2\, s_\theta \cdot \nabla \log p + \|\nabla \log p\|^2$ . The third term integrates to $2 C(p_{\text{data}})$ , independent of $\theta$ . For the cross term, use $\nabla \log p = \nabla p / p$ to write $\mathbb{E}_p[s_\theta \cdot \nabla \log p] = \int s_\theta \cdot \nabla p\, dx$ . Apply integration by parts component-wise: $\int (s_\theta)_i\, \partial_i p\, dx = [(s_\theta)_i\, p]_{-\infty}^{\infty} - \int p\, \partial_i (s_\theta)_i\, dx = -\int p\, \partial_i (s_\theta)_i\, dx$ (boundary term vanishes by the decay assumption). Summing over $i$ gives $-\mathbb{E}_p[\mathrm{tr}(\nabla s_\theta)]$ . Substituting back gives $\tfrac{1}{2} J_{\text{ESM}} = \tfrac{1}{2}\mathbb{E}_p\|s_\theta\|^2 + \mathbb{E}_p[\mathrm{tr}(\nabla s_\theta)] + C$ , which is $J_{\text{ISM}} + C$ .

Why It Matters

This single integration-by-parts identity lets you train a model of the score without ever knowing the data density, without normalization constants, and without samples from the model. It is the foundation that makes energy-based models trainable. For a parametrization $s_\theta(x) = -\nabla_x E_\theta(x)$ , $J_{\text{ISM}}$ depends on $E_\theta$ only through its gradients, never through $Z(\theta)$ . Every later score-matching method (denoising, sliced, multi-noise) is a way of estimating the same loss more efficiently.

Failure Mode

The Jacobian trace $\mathrm{tr}(\nabla_x s_\theta)$ requires $d$ backward passes to compute exactly (one per coordinate of $s_\theta$ ). For $d \sim 10^4$ to $10^6$ (images, video, audio) this is prohibitive. The implicit loss is theoretically clean but computationally infeasible in high dimensions. The decay assumption also fails for distributions on bounded support (e.g., images in $[0,1]^d$ ), creating uncontrolled boundary terms; Hyvärinen (2007) extends the framework with reflection or transformation tricks for non-negative data, but the standard Gaussian-tail assumption is genuinely restrictive.

report a correction →

Sliced Score Matching

Theorem

Sliced Score Matching (Song-Garg-Shi-Ermon 2020)

Statement

Define the sliced score matching loss

J_{\text{SSM}}(\theta) = \mathbb{E}_{x \sim p_{\text{data}}}\, \mathbb{E}_{v \sim p_v} \Big[\,\tfrac{1}{2}\, (v^\top s_\theta(x))^2 + v^\top \nabla_x s_\theta(x)\, v\, \Big].

Then $J_{\text{SSM}}$ is an unbiased estimator of $J_{\text{ISM}}$ up to a $\theta$ -independent constant. In particular, $\mathbb{E}_v[v^\top \nabla_x s_\theta(x)\, v] = \mathrm{tr}(\nabla_x s_\theta(x))$ (Hutchinson's trace estimator) and $\mathbb{E}_v[(v^\top s_\theta(x))^2] = \|s_\theta(x)\|^2$ .

Intuition

The Jacobian trace is $\sum_i \partial_i (s_\theta)_i$ , which costs $d$ backward passes. Hutchinson's trick replaces the deterministic trace with the expectation over a random direction $v$ of the bilinear form $v^\top \nabla s_\theta\, v$ , which costs one Jacobian-vector product (a single extra backward pass through $v^\top s_\theta$ ). The variance-reducing payoff is that you no longer need to enumerate coordinates: high dimension stops being an obstacle.

Proof Sketch

For any matrix $A \in \mathbb{R}^{d \times d}$ and random $v$ with $\mathbb{E}[vv^\top] = I_d$ , $\mathbb{E}[v^\top A v] = \mathbb{E}[\mathrm{tr}(A vv^\top)] = \mathrm{tr}(A\, \mathbb{E}[vv^\top]) = \mathrm{tr}(A)$ . Apply with $A = \nabla_x s_\theta(x)$ for the trace term. Similarly, $\mathbb{E}[v^\top u\, v^\top u] = u^\top \mathbb{E}[vv^\top] u = \|u\|^2$ for $u = s_\theta(x)$ , recovering the squared-norm term. Substituting in $J_{\text{ISM}}$ yields $J_{\text{SSM}}$ .

Why It Matters

Sliced score matching preserves the noise-free training signal of implicit score matching while eliminating the $O(d)$ scaling. It is the right estimator when noise injection is undesirable: discrete data after relaxation, structured data with manifold support, or settings where you care about the score of $p_{\text{data}}$ directly rather than a noisy $q_\sigma$ . Variants (sliced score matching with variance reduction, SSM-VR) further cut the variance via control variates.

Failure Mode

The single-direction estimator has high variance per sample. Practitioners often average over $K$ random projections per data point, recovering some of the cost of full trace evaluation. For very high-dimensional data (megapixel images), denoising score matching still wins on cost per fixed-quality gradient because its variance comes from data sampling rather than from projection sampling, and a noise-conditional model is what downstream samplers actually need.

report a correction →

Denoising Score Matching

Theorem

Vincent's Denoising Score Matching Identity

Statement

Define the denoising score matching loss

J_{\text{DSM}}(\theta) = \tfrac{1}{2}\, \mathbb{E}_{x \sim p_{\text{data}},\, \tilde{x} \sim q_\sigma(\cdot | x)} \big[\, \|s_\theta(\tilde{x}) - \nabla_{ \tilde{x}} \log q_\sigma(\tilde{x} | x)\|^2\, \big].

Then $J_{\text{DSM}}(\theta) = J_{\text{ESM}}^{q_\sigma}(\theta) + C(q_\sigma)$ , where $J_{\text{ESM}}^{q_\sigma}$ is the explicit score matching loss against the noisy marginal $q_\sigma(\tilde{x})$ and $C$ is independent of $\theta$ . Minimizing $J_{\text{DSM}}$ trains $s_\theta$ to match the score of the noisy data distribution $q_\sigma$ .

Intuition

For Gaussian noise $\tilde{x} = x + \sigma \epsilon$ , the conditional score is $\nabla_{\tilde{x}} \log q_\sigma(\tilde{x} | x) = -(\tilde{x} - x) / \sigma^2 = -\epsilon / \sigma$ , which is just the negative noise direction scaled by $1/\sigma$ . The DSM loss reduces to $\mathbb{E}\|s_\theta(\tilde{x}) + \epsilon / \sigma\|^2$ (up to constants), which is plain least-squares regression of the model output onto the noise that was added. The Jacobian trace disappears entirely; you only need a forward pass and a regression target.

Proof Sketch

Expand the explicit score matching loss against $q_\sigma(\tilde{x})$ and substitute $q_\sigma(\tilde{x}) = \mathbb{E}_{x \sim p_{\text{data}}} [q_\sigma(\tilde{x} | x)]$ . The cross term $\mathbb{E}_{q_\sigma}[s_\theta \cdot \nabla \log q_\sigma]$ becomes $\mathbb{E}_{x, \tilde{x}}[s_\theta (\tilde{x}) \cdot \nabla_{\tilde{x}} \log q_\sigma(\tilde{x} | x)]$ via $\nabla_{\tilde{x}} q_\sigma(\tilde{x}) = \mathbb{E}_x[\nabla_{\tilde{x}} q_\sigma(\tilde{x} | x)]$ (differentiating under the integral sign) followed by Bayes' rule. Algebraic completion of the square then converts the expanded explicit loss into the DSM loss plus a $\theta$ -independent constant.

Why It Matters

Denoising score matching scales to image, audio, and video models because its loss is a single regression target per sample: no Jacobians, no divergences, no second-order quantities. This is why diffusion training is architecturally identical to supervised regression: input is a noisy image, target is the noise vector, loss is MSE. Every pixel-space and latent-space diffusion model in production trains exactly this objective across many noise scales. Without Vincent's identity, score-based generative modeling at scale would not exist.

Failure Mode

DSM trains the score of the noisy density $q_\sigma$ , not the original $p_{\text{data}}$ . For small $\sigma$ , the noisy score approximates the data score, but only away from the data support, since the noisy density smooths out the manifold structure of the data. Single-noise-level DSM therefore cannot generate clean samples; you need a schedule of noise levels (NCSN) or a continuous-time SDE noising process (Score-SDE) and a score model conditioned on $\sigma$ . The single-scale version is provably a bad sampler in high dimensions because Langevin transitions across disconnected high-density modes are exponentially slow when $\sigma$ is small.

report a correction →

Tweedie's Formula and the Posterior-Mean Denoiser

The score matching framework gives you a vector field, but it does not immediately tell you what that field means for a downstream task like denoising or sampling. Tweedie's formula (Robbins 1956; Efron 2011 for the modern statement) supplies the operational link.

Theorem

Tweedie's Formula

Statement

Under the Gaussian-noising model above,

\mathbb{E}[\,x \mid \tilde{x}\,] = \tilde{x} + \sigma^2\, \nabla_{\tilde{x}} \log q_\sigma(\tilde{x}).

The posterior mean of the clean signal given the noisy observation equals the noisy observation plus $\sigma^2$ times the score of the noisy density.

Intuition

The score of the noisy density points from low-probability noisy regions toward high-probability noisy regions. Heuristically, the score points "toward where the clean data probably lives" given what was observed. The $\sigma^2$ scaling is exactly the correction needed so that the shift recovers the Bayes-optimal denoiser, not just a direction.

Proof Sketch

Differentiate $q_\sigma(\tilde{x}) = \int q_\sigma(\tilde{x} | x)\, p_{\text{data}}(x)\, dx$ in $\tilde{x}$ : $\nabla_{\tilde{x}} q_\sigma(\tilde{x}) = \int \nabla_{\tilde{x}} q_\sigma (\tilde{x} | x)\, p_{\text{data}}(x)\, dx$ . For Gaussian $q_\sigma(\tilde{x} | x)$ , $\nabla_{\tilde{x}} q_\sigma(\tilde{x} | x) = -\frac{\tilde{x} - x}{\sigma^2}\, q_\sigma(\tilde{x} | x)$ . Substitute and divide by $q_\sigma(\tilde{x})$ : $\nabla_{\tilde{x}} \log q_\sigma(\tilde{x}) = -\frac{1}{\sigma^2}\, \mathbb{E}[\tilde{x} - x \mid \tilde{x}] = -\frac{\tilde{x} - \mathbb{E}[x \mid \tilde{x}]}{\sigma^2}$ . Rearranging gives the formula.

Why It Matters

Tweedie identifies the trained score network with a posterior-mean denoiser: $\hat{x}_\theta(\tilde{x}, \sigma) = \tilde{x} + \sigma^2\, s_\theta(\tilde{x}, \sigma)$ . This is why "diffusion" and "denoising" are interchangeable framings; they both train the same network, parameterized differently. Karras et al. (2022) make this explicit in EDM: the network outputs a denoised estimate $D_\theta$ , and the score is recovered as $s_\theta = (D_\theta - \tilde{x}) / \sigma^2$ . The same identity also underpins diffusion posterior sampling methods (DPS, Chung et al. 2023) that reuse a pre-trained score network for inverse problems via Tweedie's mean as a plug-in posterior estimate.

Failure Mode

Tweedie's mean is a first-moment statement; it does not characterize the posterior shape. For multimodal posteriors (e.g., a noisy image consistent with several plausible clean images), the mean is the average of the modes and is not itself a sample from the posterior. Diffusion samplers do not use Tweedie as a one-shot denoiser for this reason; they walk the posterior across noise scales and the posterior mean is informative only locally. Also, Tweedie holds exactly for Gaussian noise; for other noise families, the generalized Tweedie correction involves higher-order derivatives and is rarely as clean.

report a correction →

The DDPM ε-prediction parameterization (Ho-Jain-Abbeel 2020) is one specific choice in the family Tweedie's formula opens up. Karras et al. (2022) catalog the practical alternatives (score prediction $s_\theta$ , $\epsilon$ -prediction, denoiser prediction $D_\theta$ , and $v$ -prediction from Salimans-Ho 2022) and show they differ only in how loss-weighting interacts with the noise schedule. Across the family, Tweedie's identity is the algebraic glue.

One Target, Several Parameterizations

Under Gaussian noising, the same learned object can be written in several different heads. This is why diffusion papers can talk past one another while still training essentially the same model family.

Network output	Relation to the noisy score	What the loss looks like
score head $s_\theta(\tilde{x}, \sigma)$	directly approximates $\nabla_{\tilde{x}} \log q_\sigma(\tilde{x})$	explicit score regression
noise head $\epsilon_\theta(\tilde{x}, \sigma)$	$s_\theta(\tilde{x}, \sigma) = -\epsilon_\theta(\tilde{x}, \sigma) / \sigma$	DDPM-style MSE on added noise
denoiser head $D_\theta(\tilde{x}, \sigma)$	$s_\theta(\tilde{x}, \sigma) = (D_\theta(\tilde{x}, \sigma) - \tilde{x}) / \sigma^2$	posterior-mean denoiser via Tweedie

The object that changes least across papers is the noisy score field. What changes is the algebraic parameterization, the noise schedule, and the weighting of the per-scale loss. The popular $v$ -prediction parameterization is another schedule-dependent linear rewrite of the same family, mainly used to improve optimization and numerical stability in VP-style diffusion models.

Consistency and Statistical Properties

Theorem

Asymptotic Consistency of the Score Matching Estimator

Statement

Let $\hat{\theta}_n = \arg\min_\theta \hat{J}_{\text{ISM}}(\theta)$ be the sample minimizer of the implicit score matching objective on $n$ iid samples from $p_{\text{data}}$ . Under the assumptions above, $\hat{\theta}_n$ is consistent: $\hat{\theta}_n \to \theta^*$ in probability as $n \to \infty$ . Furthermore, $\hat{\theta}_n$ is asymptotically normal with a covariance matrix that is in general strictly larger (in PSD order) than the maximum-likelihood Cramér-Rao bound.

Intuition

The population $J_{\text{ISM}}$ is uniquely minimized at $\theta^*$ by the identification assumption: $J_{\text{ISM}}(\theta^*) = J_{\text{ESM}}(\theta^*) - C = -C$ achieves its minimum. Standard M-estimator theory transfers consistency from the population minimizer to the sample minimizer. The efficiency loss relative to MLE is the price you pay for sidestepping the partition function: you replace the score function of the likelihood with a different (but still unbiased) estimating equation.

Proof Sketch

Hyvärinen (2005, Theorem 2) proves consistency under the regularity conditions above using standard M-estimator arguments. For asymptotic normality and the efficiency comparison to MLE, Hyvärinen's 2005 analysis and follow-up work on structured exponential-family settings make the same point: score matching is a valid M-estimator, but not an efficient replacement for maximum likelihood. The result generalizes to denoising and sliced variants under matching regularity, with the efficiency gap depending on noise scale and projection variance respectively.

Why It Matters

Consistency is what justifies score matching as a real estimator, not just a heuristic. In the well-specified setting it converges to the truth with infinite data, and the rate is parametric ( $n^{-1/2}$ ). The efficiency loss relative to MLE matters in low-data regimes but is irrelevant for diffusion models trained on tens of millions of images. In the misspecified setting (model family does not contain the truth), score matching converges to the $\theta$ that minimizes the score-matching divergence rather than KL, and the two minima can be substantially different. This is one reason score-matched models sometimes look qualitatively different from MLE-fit models even on toy problems.

Failure Mode

The consistency result requires the boundary terms in the integration-by- parts derivation to vanish, which fails for densities with bounded support or sharp cutoffs. It also requires the model to be twice differentiable in $\theta$ , which is a non-issue for neural networks at typical points but a real concern at activation kinks. Identifiability can fail silently for overparameterized models; in that case score matching converges to a set, not a point, and the asymptotic-normality statement does not apply.

report a correction →

NCSN and the Multi-Noise Trick

A single noise scale leaves an irreducible bias-variance tradeoff. Small $\sigma$ keeps the noisy score close to the data score but leaves the disconnected-modes problem; large $\sigma$ smooths across modes but trains a score that has been pulled far from the data manifold. Song and Ermon (2019) resolved this with the Noise-Conditional Score Network (NCSN): train a single network $s_\theta(x, \sigma)$ on a geometric schedule of noise scales $\sigma_1 > \sigma_2 > \cdots > \sigma_L$ , weighting each scale's loss by $\lambda(\sigma) = \sigma^2$ to balance the contribution. At sample time, run annealed Langevin dynamics: start from pure noise at scale $\sigma_1$ , run several Langevin steps at each scale, then anneal to the next.

The Score-SDE framework (Song et al. 2021) takes the continuous-time limit: the discrete schedule becomes a forward SDE $dx = f(x, t)\, dt + g(t)\, dW$ , the per-scale objectives become a single time-integrated DSM loss

\mathcal{L}(\theta) = \mathbb{E}_t\, \mathbb{E}_{x_0, x_t}\big[\, \lambda(t)\, \|s_\theta(x_t, t) - \nabla_{x_t} \log q_t(x_t \mid x_0)\|^2 \,\big],

and Anderson's time-reversal theorem supplies the reverse SDE that turns the trained score into a sampler. DDPM falls out as the variance-preserving (VP) SDE; NCSN falls out as the variance-exploding (VE) SDE; latent diffusion (Stable Diffusion) is the same loss applied to the output of a pretrained VAE encoder.

Worked Example: Gaussian Score Matching

Take $p_{\text{data}} = \mathcal{N}(\mu, \Sigma)$ and a linear score model $s_\theta(x) = A(x - b)$ for $A \in \mathbb{R}^{d \times d}$ , $b \in \mathbb{R}^d$ . The true score is $\nabla \log p(x) = -\Sigma^{-1}(x - \mu)$ , so the optimum is $A^* = -\Sigma^{-1}$ , $b^* = \mu$ .

Compute the implicit score matching loss: $\tfrac{1}{2}\|A(x - b)\|^2 + \mathrm{tr}(A) = \tfrac{1}{2}(x - b)^\top A^\top A\, (x - b) + \mathrm{tr}(A)$ . Take expectation under $p_{\text{data}}$ : $\mathbb{E}[\tfrac{1}{2}(x - b)^\top A^\top A (x - b)] = \tfrac{1}{2}\mathrm{tr}(A^\top A\, \Sigma) + \tfrac{1}{2}(\mu - b)^\top A^\top A (\mu - b)$ .

Setting the gradient in $b$ to zero gives $b^* = \mu$ . Setting the gradient in $A$ to zero gives $A^* \Sigma + I = 0$ , i.e., $A^* = -\Sigma^{-1}$ . The implicit loss recovers the exact maximum-likelihood Gaussian fit, with no direct access to the density. This is the cleanest demonstration of the Hyvärinen identity in closed form, and it confirms the consistency result constructively for the Gaussian family.

Common Confusions

Watch Out

This score is a gradient in x, not in θ

In classical statistics, the word score often means the likelihood score $\nabla_\theta \log p_\theta(x)$ , a derivative with respect to model parameters. In score matching, the score is $\nabla_x \log p(x)$ , a vector field over data space. These are different objects with different dimensions, different roles, and different theorems. Score matching trains the data-space score field; it is not solving the likelihood score equation in parameter space.

Watch Out

Score matching is not minimizing KL to the data distribution

Score matching minimizes the squared $L^2$ distance between score fields, not the KL divergence between distributions. The two objectives have different gradients and different optima in general. They agree at the global minimum (both achieve zero only when $p_\theta = p_{\text{data}}$ ), but the geometry of the optimization landscape is different. In particular, score matching is consistent under model well-specification, but the bias-variance tradeoff in finite samples differs from MLE, and on misspecified models the two converge to different projections of $p_{\text{data}}$ onto the model family.

Watch Out

The Jacobian trace is not the same as the model output's norm

Beginners sometimes confuse $\mathrm{tr}(\nabla s_\theta)$ with $\|s_\theta\|^2$ or with the Jacobian's Frobenius norm. The trace is the divergence of the score field, $\sum_i \partial_{x_i} (s_\theta)_i$ , which can be negative, zero, or positive independently of the score's magnitude. The implicit score matching loss has both terms, and the trace term is what makes the loss work; without it, the optimum would be $s_\theta \equiv 0$ .

Watch Out

Denoising score matching does not train the data score directly

DSM trains $s_\theta(\tilde{x}) \approx \nabla_{\tilde{x}} \log q_\sigma (\tilde{x})$ , the score of the noisy data density $q_\sigma$ . As $\sigma \to 0$ , $q_\sigma \to p_{\text{data}}$ and the DSM target approaches the data score, but for any fixed $\sigma > 0$ they are different. To learn the data score itself, you need a schedule of $\sigma$ values approaching zero, or a continuous-time SDE formulation where $\sigma(t)$ is part of the model.

Watch Out

Score matching does not require samples from the model

Unlike contrastive divergence and other EBM training methods, score matching needs no MCMC sampling from $p_\theta$ during training. The only data it consumes are samples from $p_{\text{data}}$ . This is what makes it practically attractive for high-dimensional EBMs where MCMC mixing is the bottleneck. The price is the Jacobian trace (implicit), the projection variance (sliced), or the smoothing bias (denoising), but never an expectation over the model.

Watch Out

Tweedie's formula gives a posterior mean, not a sample

$\mathbb{E}[x \mid \tilde{x}] = \tilde{x} + \sigma^2\, \nabla \log q_\sigma$ is a mean, not a draw. For a noisy image consistent with several plausible clean images, the Tweedie estimate is an average of those modes, often a blurry interpolation that is not itself a plausible clean image. Diffusion samplers walk the posterior across noise scales precisely because the posterior is multimodal and a one-shot mean is not a generative sample.

Summary

Score matching learns a data-space score field, not the parameter-space likelihood score.
Hyvarinen's theorem removes the unknown data score by trading it for the divergence of the model score field.
Sliced score matching keeps the same target but estimates the divergence stochastically with random projections.
Denoising score matching changes the target to the score of a noisy marginal, turning training into a regression problem.
Tweedie's formula identifies that noisy score with a posterior-mean denoiser under Gaussian corruption.
Modern diffusion models are multi-scale denoising score matching plus a reverse-time sampler.

Exercises

ExerciseCore

Problem

Derive the denoising score matching target for Gaussian noise: $\tilde{x} = x + \sigma \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$ and $x \sim p_{\text{data}}$ . Show that the DSM loss reduces, up to multiplicative and additive constants in $\theta$ , to $\mathbb{E}_{x, \epsilon}\,\|\sigma\, s_\theta(\tilde{x}) + \epsilon\|^2$ .

ExerciseAdvanced

Problem

Prove Tweedie's formula: for $\tilde{x} = x + \sigma \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$ independent of $x \sim p_{\text{data}}$ and $q_\sigma(\tilde{x}) = \int q_\sigma(\tilde{x} | x) p_{\text{data}}(x) \,dx$ ,

\mathbb{E}[x \mid \tilde{x}] = \tilde{x} + \sigma^2\, \nabla_{\tilde{x}} \log q_\sigma(\tilde{x}).

ExerciseAdvanced

Problem

Show that the sliced score matching estimator with $v \sim \mathcal{N}(0, I_d)$ is unbiased for $J_{\text{ISM}}$ . Compute the variance of its single-direction trace estimator $v^\top \nabla s_\theta\, v$ and explain why averaging over $K$ directions reduces variance by a factor of $K$ .

ExerciseAdvanced

Problem

Consider score matching applied to an energy-based model $p_\theta(x) = e^{-E_\theta(x)} / Z(\theta)$ with $s_\theta(x) = -\nabla_x E_\theta(x)$ . Show that the implicit score matching loss $J_{\text{ISM}}(\theta)$ depends on $E_\theta$ only through its first and second derivatives, and explicitly never through $Z(\theta)$ .

ExerciseResearch

Problem

Single-noise-level DSM is provably a poor sampler in high dimensions because of disconnected-modes mixing problems for Langevin dynamics. NCSN solves this with a geometric noise schedule and annealed Langevin. Sketch why a linear (rather than geometric) schedule of $\sigma$ values is suboptimal for sampling, and give an intuition based on how the data manifold's log-density curvature scales with noise.

References

Canonical:

Hyvärinen, Estimation of non-normalized statistical models by score matching (Journal of Machine Learning Research 6, 2005). The original paper. Introduces implicit score matching, the integration-by-parts identity, and the consistency theorem.
Vincent, A connection between score matching and denoising autoencoders (Neural Computation 23, 2011). Establishes the equivalence between DSM and explicit score matching against the noisy marginal $q_\sigma$ .
Song, Garg, Shi, and Ermon, Sliced score matching: a scalable approach to density and score estimation (UAI 2020), arXiv:1905.07088. Hutchinson's-trick estimator that replaces the $O(d)$ Jacobian trace with one Jacobian-vector product.
Robbins, An empirical Bayes approach to statistics (Berkeley Symposium 1956), Section 4. The original derivation of Tweedie's formula in the empirical Bayes setting; the score-as-denoiser identity predates score matching by 50 years.
Efron, Tweedie's formula and selection bias (Journal of the American Statistical Association 106, 2011). Modern restatement, generalizations to non-Gaussian noise, and extensive discussion of statistical applications.

Current:

Song and Ermon, Generative modeling by estimating gradients of the data distribution (NeurIPS 2019), arXiv:1907.05600. The NCSN paper. Multi-scale score matching with annealed Langevin sampling.
Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole, Score-based generative modeling through stochastic differential equations (ICLR 2021), arXiv:2011.13456. Continuous-time SDE formulation that unifies NCSN, DDPM, and probability-flow ODEs under one score-matching framework.
Ho, Jain, and Abbeel, Denoising diffusion probabilistic models (NeurIPS 2020), arXiv:2006.11239. Reparameterizes DSM as $\epsilon$ -prediction; the loss form used in virtually all modern image diffusion models.
Karras, Aittala, Aila, and Laine, Elucidating the design space of diffusion-based generative models (NeurIPS 2022), arXiv:2206.00364. Treats the score network as a posterior-mean denoiser via Tweedie's formula and analyzes loss-weighting across noise scales.
Salimans and Ho, Progressive distillation for fast sampling of diffusion models (ICLR 2022), arXiv:2202.00512. Introduces the $v$ -prediction parameterization and shows it is more numerically stable than $\epsilon$ -prediction at large $\sigma$ .

Extensions:

Hyvärinen, Some extensions of score matching (Computational Statistics and Data Analysis 51, 2007). Extends the framework to non-negative data and bounded supports via reflection / transformation tricks.
Lyu, Interpretation and generalization of score matching (UAI 2009). Connects score matching to a family of generalized scoring rules and discusses regularization.
Chung, Kim, Mccann, Klasky, and Ye, Diffusion posterior sampling for general noisy inverse problems (ICLR 2023), arXiv:2209.14687. Uses Tweedie's mean as a plug-in posterior estimator inside a guided diffusion sampler.
Song and Kingma, How to train your energy-based models (arXiv:2101.03288, 2021). Reviews score matching, contrastive divergence, and noise-contrastive estimation side by side as EBM training methods.

Next Topics

Diffusion Models: the generative modeling framework built on multi-scale denoising score matching with Anderson's reverse-time SDE.
Time Reversal of SDEs: Anderson's theorem that turns the learned score into a generative sampler.
Langevin Dynamics: the SDE that the learned score field drives at sampling time.
Energy-Based Models: the unnormalized-density framework that motivates score matching in the first place.
Fokker-Planck Equation: the PDE that governs the noisy marginals $q_\sigma(\tilde{x})$ along the noising schedule.
Stochastic Differential Equations: the continuous-time framework for the forward noising process.

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

7

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
Fokker–Planck Equationlayer 3 · tier 2
Stochastic Differential Equationslayer 3 · tier 2
Time Reversal of SDEslayer 3 · tier 2

Derived topics

4

Diffusion Modelslayer 4 · tier 1
Langevin Dynamicslayer 3 · tier 2
Probability Flow ODElayer 3 · tier 2
Energy-Based Modelslayer 3 · tier 3

Graph-backed continuations

Diffusion Models Langevin Dynamics Energy-Based Models Probability Flow ODE