ML Methods
Score Matching
Hyvärinen 2005: train a model to estimate the score (gradient of log density) without computing the normalization constant. Integration by parts converts the intractable density-matching loss into a tractable gradient-based objective. Sliced score matching makes the Jacobian-trace term scale, denoising score matching reparameterizes the loss as ε-regression, and Tweedie's formula identifies the score with a posterior-mean denoiser. Together these are the training half of every modern diffusion model and energy-based model.
Prerequisites
Why This Matters
Most expressive probability models cannot be trained by maximum likelihood because the normalization constant is intractable. Energy-based models, score-based diffusion, and most generative samplers all face this obstacle. Score matching (Hyvärinen 2005) is the workaround: instead of fitting the density directly, fit the score . The score is invariant to the partition function: , so matching scores never requires knowing .
Hyvärinen turned this idea into a tractable loss via integration by parts: the squared-distance objective between model and data scores reduces, up to a -independent constant, to a quantity computable from samples plus a single Jacobian-trace term. Vincent (2011) replaced the Jacobian trace with a denoising target, giving the denoising score matching loss that scales to high dimensions. Song, Garg, Shi, and Ermon (2020) gave the sliced variant that estimates the trace stochastically, recovering tractability without any noise injection. Robbins (1956) and Efron (2011) supply the third piece, Tweedie's formula, which identifies the score of a Gaussian-noised density with a posterior-mean denoiser, fixing the operational meaning of what a trained score network actually represents.
Song and Ermon (2019) added multiple noise scales and turned the loss into the noise-conditional score network (NCSN); Song et al. (2021) wrapped the whole setup inside a forward noising SDE and used Anderson's time-reversal theorem to turn the learned score into a generative sampler. The thread from Hyvärinen 2005 to Stable Diffusion is straight: every score-based generative model is doing denoising score matching at multiple noise levels, then plugging the learned score into a reverse SDE or probability-flow ODE. If you understand score matching, you understand the training half of diffusion models.
Mental Model
The score is a vector field on that points from low-density regions toward high-density regions. It is the direction of steepest log-density increase and the gradient field that Langevin dynamics follows to sample from : has as its stationary distribution.
Score matching trains a parametric vector field to approximate this field. The natural loss is the squared error between data and model scores, weighted by the data density:
This is the explicit score matching loss. It is intractable because we do not know . The story of score matching is the story of progressively more practical surrogates. The first three (ESM, ISM, SSM) all target the same population minimizer — the score of the clean data density . Denoising score matching (DSM) is different: it targets the score of the noisy density , and only recovers the clean score in the limit . So DSM is not an unbiased estimator of the ESM loss.
| Estimator | Cost per sample | Targets the score of | Where to use |
|---|---|---|---|
| Explicit (ESM) | requires | infeasible (oracle only) | |
| Implicit (Hyvärinen) | full Jacobian trace, backward passes | low-dim models, EBMs, energy parameterization | |
| Sliced (Song et al.) | one Jacobian-vector product, in | medium-dim, no noise allowed | |
| Denoising (Vincent) | one forward pass, MSE regression | high-dim image/audio/video; the standard |
The first three rows are equivalent at the population level (modulo constants); they estimate the same target score. DSM trades that equivalence for a tractable conditional regression and learns a smoothed score; this is exactly what diffusion sampling needs, but it is a different estimand from ESM, not a cheaper unbiased estimator of it.
Score matching keeps the vector field and swaps the supervision
The object is always the same: a field that points from low-density regions toward high-density ones. The methods differ only in how they estimate or re-express the target without knowing the partition function.
oracle target
Explicit score matching asks for the inaccessible object directly
This form needs ∇ log p_data(x) as supervision. It is conceptually clean but useless in practice because the data score is exactly the thing we do not know.
Same geometry, cheaper supervision
Each variant still tries to learn the vector field that points toward high-density regions. What changes is only the training signal used to estimate that field.
Why diffusion uses the denoising row
High-dimensional image models cannot afford an exact Jacobian trace. Denoising score matching turns the problem into ordinary regression against a noisy target, which scales to modern diffusion training.
Sampling connection
Once the score field is learned, Langevin dynamics, reverse SDEs, or probability-flow ODEs can use that field to move samples back toward the data manifold.
Use the diagram as a map of the three objectives. The left panel keeps the same target geometry in view, while the right panel switches only the training signal: explicit score matching asks for the unknown data score, implicit score matching swaps that for a divergence term via integration by parts, and denoising score matching regresses on a noisy conditional score.
Hyvärinen's Implicit Score Matching
Score Function
For a differentiable density on , the score (or Stein score) is the gradient of its log density, . It is invariant to multiplicative constants: if then . This invariance is what makes score matching a viable training signal for unnormalized models.
Implicit Score Matching Loss
The implicit score matching loss for a score model is
where is the Jacobian of at and is its trace, the divergence of the score field. The loss is computable from samples alone. There is no reference to , no normalization constant.
Hyvärinen's Implicit Score Matching Theorem
Statement
Under the assumptions above,
where is the data Fisher information, independent of . Minimizing is therefore equivalent to minimizing .
Intuition
The cross term in the explicit loss is what we cannot evaluate. But , so . Integration by parts converts into . The intractable density gradient becomes a tractable divergence of the model itself.
Proof Sketch
Expand the squared-distance loss: . The third term integrates to , independent of . For the cross term, use to write . Apply integration by parts component-wise: (boundary term vanishes by the decay assumption). Summing over gives . Substituting back gives , which is .
Why It Matters
This single integration-by-parts identity lets you train a model of the score without ever knowing the data density, without normalization constants, and without samples from the model. It is the foundation that makes energy-based models trainable. For a parametrization , depends on only through its gradients, never through . Every later score-matching method (denoising, sliced, multi-noise) is a way of estimating the same loss more efficiently.
Failure Mode
The Jacobian trace requires backward passes to compute exactly (one per coordinate of ). For to (images, video, audio) this is prohibitive. The implicit loss is theoretically clean but computationally infeasible in high dimensions. The decay assumption also fails for distributions on bounded support (e.g., images in ), creating uncontrolled boundary terms; Hyvärinen (2007) extends the framework with reflection or transformation tricks for non-negative data, but the standard Gaussian-tail assumption is genuinely restrictive.
Sliced Score Matching
Sliced Score Matching (Song-Garg-Shi-Ermon 2020)
Statement
Define the sliced score matching loss
Then is an unbiased estimator of up to a -independent constant. In particular, (Hutchinson's trace estimator) and .
Intuition
The Jacobian trace is , which costs backward passes. Hutchinson's trick replaces the deterministic trace with the expectation over a random direction of the bilinear form , which costs one Jacobian-vector product (a single extra backward pass through ). The variance-reducing payoff is that you no longer need to enumerate coordinates: high dimension stops being an obstacle.
Proof Sketch
For any matrix and random with , . Apply with for the trace term. Similarly, for , recovering the squared-norm term. Substituting in yields .
Why It Matters
Sliced score matching preserves the noise-free training signal of implicit score matching while eliminating the scaling. It is the right estimator when noise injection is undesirable: discrete data after relaxation, structured data with manifold support, or settings where you care about the score of directly rather than a noisy . Variants (sliced score matching with variance reduction, SSM-VR) further cut the variance via control variates.
Failure Mode
The single-direction estimator has high variance per sample. Practitioners often average over random projections per data point, recovering some of the cost of full trace evaluation. For very high-dimensional data (megapixel images), denoising score matching still wins on cost per fixed-quality gradient because its variance comes from data sampling rather than from projection sampling, and a noise-conditional model is what downstream samplers actually need.
Denoising Score Matching
Vincent's Denoising Score Matching Identity
Statement
Define the denoising score matching loss
Then , where is the explicit score matching loss against the noisy marginal and is independent of . Minimizing trains to match the score of the noisy data distribution .
Intuition
For Gaussian noise , the conditional score is , which is just the negative noise direction scaled by . The DSM loss reduces to (up to constants), which is plain least-squares regression of the model output onto the noise that was added. The Jacobian trace disappears entirely; you only need a forward pass and a regression target.
Proof Sketch
Expand the explicit score matching loss against and substitute . The cross term becomes via (differentiating under the integral sign) followed by Bayes' rule. Algebraic completion of the square then converts the expanded explicit loss into the DSM loss plus a -independent constant.
Why It Matters
Denoising score matching scales to image, audio, and video models because its loss is a single regression target per sample: no Jacobians, no divergences, no second-order quantities. This is why diffusion training is architecturally identical to supervised regression: input is a noisy image, target is the noise vector, loss is MSE. Every pixel-space and latent-space diffusion model in production trains exactly this objective across many noise scales. Without Vincent's identity, score-based generative modeling at scale would not exist.
Failure Mode
DSM trains the score of the noisy density , not the original . For small , the noisy score approximates the data score, but only away from the data support, since the noisy density smooths out the manifold structure of the data. Single-noise-level DSM therefore cannot generate clean samples; you need a schedule of noise levels (NCSN) or a continuous-time SDE noising process (Score-SDE) and a score model conditioned on . The single-scale version is provably a bad sampler in high dimensions because Langevin transitions across disconnected high-density modes are exponentially slow when is small.
Tweedie's Formula and the Posterior-Mean Denoiser
The score matching framework gives you a vector field, but it does not immediately tell you what that field means for a downstream task like denoising or sampling. Tweedie's formula (Robbins 1956; Efron 2011 for the modern statement) supplies the operational link.
Tweedie's Formula
Statement
Under the Gaussian-noising model above,
The posterior mean of the clean signal given the noisy observation equals the noisy observation plus times the score of the noisy density.
Intuition
The score of the noisy density points from low-probability noisy regions toward high-probability noisy regions. Heuristically, the score points "toward where the clean data probably lives" given what was observed. The scaling is exactly the correction needed so that the shift recovers the Bayes-optimal denoiser, not just a direction.
Proof Sketch
Differentiate in : . For Gaussian , . Substitute and divide by : . Rearranging gives the formula.
Why It Matters
Tweedie identifies the trained score network with a posterior-mean denoiser: . This is why "diffusion" and "denoising" are interchangeable framings; they both train the same network, parameterized differently. Karras et al. (2022) make this explicit in EDM: the network outputs a denoised estimate , and the score is recovered as . The same identity also underpins diffusion posterior sampling methods (DPS, Chung et al. 2023) that reuse a pre-trained score network for inverse problems via Tweedie's mean as a plug-in posterior estimate.
Failure Mode
Tweedie's mean is a first-moment statement; it does not characterize the posterior shape. For multimodal posteriors (e.g., a noisy image consistent with several plausible clean images), the mean is the average of the modes and is not itself a sample from the posterior. Diffusion samplers do not use Tweedie as a one-shot denoiser for this reason; they walk the posterior across noise scales and the posterior mean is informative only locally. Also, Tweedie holds exactly for Gaussian noise; for other noise families, the generalized Tweedie correction involves higher-order derivatives and is rarely as clean.
The DDPM ε-prediction parameterization (Ho-Jain-Abbeel 2020) is one specific choice in the family Tweedie's formula opens up. Karras et al. (2022) catalog the practical alternatives (score prediction , -prediction, denoiser prediction , and -prediction from Salimans-Ho 2022) and show they differ only in how loss-weighting interacts with the noise schedule. Across the family, Tweedie's identity is the algebraic glue.
One Target, Several Parameterizations
Under Gaussian noising, the same learned object can be written in several different heads. This is why diffusion papers can talk past one another while still training essentially the same model family.
| Network output | Relation to the noisy score | What the loss looks like |
|---|---|---|
| score head | directly approximates | explicit score regression |
| noise head | DDPM-style MSE on added noise | |
| denoiser head | posterior-mean denoiser via Tweedie |
The object that changes least across papers is the noisy score field. What changes is the algebraic parameterization, the noise schedule, and the weighting of the per-scale loss. The popular -prediction parameterization is another schedule-dependent linear rewrite of the same family, mainly used to improve optimization and numerical stability in VP-style diffusion models.
Consistency and Statistical Properties
Asymptotic Consistency of the Score Matching Estimator
Statement
Let be the sample minimizer of the implicit score matching objective on iid samples from . Under the assumptions above, is consistent: in probability as . Furthermore, is asymptotically normal with a covariance matrix that is in general strictly larger (in PSD order) than the maximum-likelihood Cramér-Rao bound.
Intuition
The population is uniquely minimized at by the identification assumption: achieves its minimum. Standard M-estimator theory transfers consistency from the population minimizer to the sample minimizer. The efficiency loss relative to MLE is the price you pay for sidestepping the partition function: you replace the score function of the likelihood with a different (but still unbiased) estimating equation.
Proof Sketch
Hyvärinen (2005, Theorem 2) proves consistency under the regularity conditions above using standard M-estimator arguments. For asymptotic normality and the efficiency comparison to MLE, Hyvärinen's 2005 analysis and follow-up work on structured exponential-family settings make the same point: score matching is a valid M-estimator, but not an efficient replacement for maximum likelihood. The result generalizes to denoising and sliced variants under matching regularity, with the efficiency gap depending on noise scale and projection variance respectively.
Why It Matters
Consistency is what justifies score matching as a real estimator, not just a heuristic. In the well-specified setting it converges to the truth with infinite data, and the rate is parametric (). The efficiency loss relative to MLE matters in low-data regimes but is irrelevant for diffusion models trained on tens of millions of images. In the misspecified setting (model family does not contain the truth), score matching converges to the that minimizes the score-matching divergence rather than KL, and the two minima can be substantially different. This is one reason score-matched models sometimes look qualitatively different from MLE-fit models even on toy problems.
Failure Mode
The consistency result requires the boundary terms in the integration-by- parts derivation to vanish, which fails for densities with bounded support or sharp cutoffs. It also requires the model to be twice differentiable in , which is a non-issue for neural networks at typical points but a real concern at activation kinks. Identifiability can fail silently for overparameterized models; in that case score matching converges to a set, not a point, and the asymptotic-normality statement does not apply.
NCSN and the Multi-Noise Trick
A single noise scale leaves an irreducible bias-variance tradeoff. Small keeps the noisy score close to the data score but leaves the disconnected-modes problem; large smooths across modes but trains a score that has been pulled far from the data manifold. Song and Ermon (2019) resolved this with the Noise-Conditional Score Network (NCSN): train a single network on a geometric schedule of noise scales , weighting each scale's loss by to balance the contribution. At sample time, run annealed Langevin dynamics: start from pure noise at scale , run several Langevin steps at each scale, then anneal to the next.
The Score-SDE framework (Song et al. 2021) takes the continuous-time limit: the discrete schedule becomes a forward SDE , the per-scale objectives become a single time-integrated DSM loss
and Anderson's time-reversal theorem supplies the reverse SDE that turns the trained score into a sampler. DDPM falls out as the variance-preserving (VP) SDE; NCSN falls out as the variance-exploding (VE) SDE; latent diffusion (Stable Diffusion) is the same loss applied to the output of a pretrained VAE encoder.
Worked Example: Gaussian Score Matching
Take and a linear score model for , . The true score is , so the optimum is , .
Compute the implicit score matching loss: . Take expectation under : .
Setting the gradient in to zero gives . Setting the gradient in to zero gives , i.e., . The implicit loss recovers the exact maximum-likelihood Gaussian fit, with no direct access to the density. This is the cleanest demonstration of the Hyvärinen identity in closed form, and it confirms the consistency result constructively for the Gaussian family.
Common Confusions
This score is a gradient in x, not in θ
In classical statistics, the word score often means the likelihood score , a derivative with respect to model parameters. In score matching, the score is , a vector field over data space. These are different objects with different dimensions, different roles, and different theorems. Score matching trains the data-space score field; it is not solving the likelihood score equation in parameter space.
Score matching is not minimizing KL to the data distribution
Score matching minimizes the squared distance between score fields, not the KL divergence between distributions. The two objectives have different gradients and different optima in general. They agree at the global minimum (both achieve zero only when ), but the geometry of the optimization landscape is different. In particular, score matching is consistent under model well-specification, but the bias-variance tradeoff in finite samples differs from MLE, and on misspecified models the two converge to different projections of onto the model family.
The Jacobian trace is not the same as the model output's norm
Beginners sometimes confuse with or with the Jacobian's Frobenius norm. The trace is the divergence of the score field, , which can be negative, zero, or positive independently of the score's magnitude. The implicit score matching loss has both terms, and the trace term is what makes the loss work; without it, the optimum would be .
Denoising score matching does not train the data score directly
DSM trains , the score of the noisy data density . As , and the DSM target approaches the data score, but for any fixed they are different. To learn the data score itself, you need a schedule of values approaching zero, or a continuous-time SDE formulation where is part of the model.
Score matching does not require samples from the model
Unlike contrastive divergence and other EBM training methods, score matching needs no MCMC sampling from during training. The only data it consumes are samples from . This is what makes it practically attractive for high-dimensional EBMs where MCMC mixing is the bottleneck. The price is the Jacobian trace (implicit), the projection variance (sliced), or the smoothing bias (denoising), but never an expectation over the model.
Tweedie's formula gives a posterior mean, not a sample
is a mean, not a draw. For a noisy image consistent with several plausible clean images, the Tweedie estimate is an average of those modes, often a blurry interpolation that is not itself a plausible clean image. Diffusion samplers walk the posterior across noise scales precisely because the posterior is multimodal and a one-shot mean is not a generative sample.
Summary
- Score matching learns a data-space score field, not the parameter-space likelihood score.
- Hyvarinen's theorem removes the unknown data score by trading it for the divergence of the model score field.
- Sliced score matching keeps the same target but estimates the divergence stochastically with random projections.
- Denoising score matching changes the target to the score of a noisy marginal, turning training into a regression problem.
- Tweedie's formula identifies that noisy score with a posterior-mean denoiser under Gaussian corruption.
- Modern diffusion models are multi-scale denoising score matching plus a reverse-time sampler.
Exercises
Problem
Derive the denoising score matching target for Gaussian noise: with and . Show that the DSM loss reduces, up to multiplicative and additive constants in , to .
Problem
Prove Tweedie's formula: for with independent of and ,
Problem
Show that the sliced score matching estimator with is unbiased for . Compute the variance of its single-direction trace estimator and explain why averaging over directions reduces variance by a factor of .
Problem
Consider score matching applied to an energy-based model with . Show that the implicit score matching loss depends on only through its first and second derivatives, and explicitly never through .
Problem
Single-noise-level DSM is provably a poor sampler in high dimensions because of disconnected-modes mixing problems for Langevin dynamics. NCSN solves this with a geometric noise schedule and annealed Langevin. Sketch why a linear (rather than geometric) schedule of values is suboptimal for sampling, and give an intuition based on how the data manifold's log-density curvature scales with noise.
References
Canonical:
- Hyvärinen, Estimation of non-normalized statistical models by score matching (Journal of Machine Learning Research 6, 2005). The original paper. Introduces implicit score matching, the integration-by-parts identity, and the consistency theorem.
- Vincent, A connection between score matching and denoising autoencoders (Neural Computation 23, 2011). Establishes the equivalence between DSM and explicit score matching against the noisy marginal .
- Song, Garg, Shi, and Ermon, Sliced score matching: a scalable approach to density and score estimation (UAI 2020), arXiv:1905.07088. Hutchinson's-trick estimator that replaces the Jacobian trace with one Jacobian-vector product.
- Robbins, An empirical Bayes approach to statistics (Berkeley Symposium 1956), Section 4. The original derivation of Tweedie's formula in the empirical Bayes setting; the score-as-denoiser identity predates score matching by 50 years.
- Efron, Tweedie's formula and selection bias (Journal of the American Statistical Association 106, 2011). Modern restatement, generalizations to non-Gaussian noise, and extensive discussion of statistical applications.
Current:
- Song and Ermon, Generative modeling by estimating gradients of the data distribution (NeurIPS 2019), arXiv:1907.05600. The NCSN paper. Multi-scale score matching with annealed Langevin sampling.
- Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole, Score-based generative modeling through stochastic differential equations (ICLR 2021), arXiv:2011.13456. Continuous-time SDE formulation that unifies NCSN, DDPM, and probability-flow ODEs under one score-matching framework.
- Ho, Jain, and Abbeel, Denoising diffusion probabilistic models (NeurIPS 2020), arXiv:2006.11239. Reparameterizes DSM as -prediction; the loss form used in virtually all modern image diffusion models.
- Karras, Aittala, Aila, and Laine, Elucidating the design space of diffusion-based generative models (NeurIPS 2022), arXiv:2206.00364. Treats the score network as a posterior-mean denoiser via Tweedie's formula and analyzes loss-weighting across noise scales.
- Salimans and Ho, Progressive distillation for fast sampling of diffusion models (ICLR 2022), arXiv:2202.00512. Introduces the -prediction parameterization and shows it is more numerically stable than -prediction at large .
Extensions:
- Hyvärinen, Some extensions of score matching (Computational Statistics and Data Analysis 51, 2007). Extends the framework to non-negative data and bounded supports via reflection / transformation tricks.
- Lyu, Interpretation and generalization of score matching (UAI 2009). Connects score matching to a family of generalized scoring rules and discusses regularization.
- Chung, Kim, Mccann, Klasky, and Ye, Diffusion posterior sampling for general noisy inverse problems (ICLR 2023), arXiv:2209.14687. Uses Tweedie's mean as a plug-in posterior estimator inside a guided diffusion sampler.
- Song and Kingma, How to train your energy-based models (arXiv:2101.03288, 2021). Reviews score matching, contrastive divergence, and noise-contrastive estimation side by side as EBM training methods.
Next Topics
- Diffusion Models: the generative modeling framework built on multi-scale denoising score matching with Anderson's reverse-time SDE.
- Time Reversal of SDEs: Anderson's theorem that turns the learned score into a generative sampler.
- Langevin Dynamics: the SDE that the learned score field drives at sampling time.
- Energy-Based Models: the unnormalized-density framework that motivates score matching in the first place.
- Fokker-Planck Equation: the PDE that governs the noisy marginals along the noising schedule.
- Stochastic Differential Equations: the continuous-time framework for the forward noising process.
Last reviewed: April 19, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
7- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
- Fokker–Planck Equationlayer 3 · tier 2
- Stochastic Differential Equationslayer 3 · tier 2
- Time Reversal of SDEslayer 3 · tier 2
Derived topics
4- Diffusion Modelslayer 4 · tier 1
- Langevin Dynamicslayer 3 · tier 2
- Probability Flow ODElayer 3 · tier 2
- Energy-Based Modelslayer 3 · tier 3
Graph-backed continuations