Skip to main content

ML Methods

Score Matching

Hyvärinen 2005: train a model to estimate the score (gradient of log density) without computing the normalization constant. Integration by parts converts the intractable density-matching loss into a tractable gradient-based objective. Sliced score matching makes the Jacobian-trace term scale, denoising score matching reparameterizes the loss as ε-regression, and Tweedie's formula identifies the score with a posterior-mean denoiser. Together these are the training half of every modern diffusion model and energy-based model.

AdvancedTier 1StableCore spine~80 min

Why This Matters

Most expressive probability models cannot be trained by maximum likelihood because the normalization constant Z(θ)=eEθ(x)dxZ(\theta) = \int e^{-E_\theta(x)}\,dx is intractable. Energy-based models, score-based diffusion, and most generative samplers all face this obstacle. Score matching (Hyvärinen 2005) is the workaround: instead of fitting the density pθ(x)p_\theta(x) directly, fit the score xlogp(x)\nabla_x \log p(x). The score is invariant to the partition function: xlogZ(θ)=0\nabla_x \log Z(\theta) = 0, so matching scores never requires knowing ZZ.

Hyvärinen turned this idea into a tractable loss via integration by parts: the squared-distance objective between model and data scores reduces, up to a θ\theta-independent constant, to a quantity computable from samples plus a single Jacobian-trace term. Vincent (2011) replaced the Jacobian trace with a denoising target, giving the denoising score matching loss that scales to high dimensions. Song, Garg, Shi, and Ermon (2020) gave the sliced variant that estimates the trace stochastically, recovering tractability without any noise injection. Robbins (1956) and Efron (2011) supply the third piece, Tweedie's formula, which identifies the score of a Gaussian-noised density with a posterior-mean denoiser, fixing the operational meaning of what a trained score network actually represents.

Song and Ermon (2019) added multiple noise scales and turned the loss into the noise-conditional score network (NCSN); Song et al. (2021) wrapped the whole setup inside a forward noising SDE and used Anderson's time-reversal theorem to turn the learned score into a generative sampler. The thread from Hyvärinen 2005 to Stable Diffusion is straight: every score-based generative model is doing denoising score matching at multiple noise levels, then plugging the learned score into a reverse SDE or probability-flow ODE. If you understand score matching, you understand the training half of diffusion models.

Mental Model

The score s(x):=xlogp(x)s(x) := \nabla_x \log p(x) is a vector field on Rd\mathbb{R}^d that points from low-density regions toward high-density regions. It is the direction of steepest log-density increase and the gradient field that Langevin dynamics follows to sample from pp: dXt=s(Xt)dt+2dWtdX_t = s(X_t)\,dt + \sqrt{2}\,dW_t has pp as its stationary distribution.

Score matching trains a parametric vector field sθ(x)Rds_\theta(x) \in \mathbb{R}^d to approximate this field. The natural loss is the squared L2L^2 error between data and model scores, weighted by the data density:

JESM(θ)=12Expdata[sθ(x)xlogpdata(x)2].J_{\text{ESM}}(\theta) = \tfrac{1}{2}\, \mathbb{E}_{x \sim p_{\text{data}}} \big[\, \|s_\theta(x) - \nabla_x \log p_{\text{data}}(x)\|^2\, \big].

This is the explicit score matching loss. It is intractable because we do not know xlogpdata\nabla_x \log p_{\text{data}}. The story of score matching is the story of progressively more practical surrogates. The first three (ESM, ISM, SSM) all target the same population minimizer — the score of the clean data density pdatap_{\text{data}}. Denoising score matching (DSM) is different: it targets the score of the noisy density qσ=pdataN(0,σ2I)q_\sigma = p_{\text{data}} \ast \mathcal{N}(0, \sigma^2 I), and only recovers the clean score in the limit σ0\sigma \to 0. So DSM is not an unbiased estimator of the ESM loss.

EstimatorCost per sampleTargets the score ofWhere to use
Explicit (ESM)requires logpdata\nabla \log p_{\text{data}}pdatap_{\text{data}}infeasible (oracle only)
Implicit (Hyvärinen)full Jacobian trace, O(d)O(d) backward passespdatap_{\text{data}}low-dim models, EBMs, energy parameterization
Sliced (Song et al.)one Jacobian-vector product, O(1)O(1) in ddpdatap_{\text{data}}medium-dim, no noise allowed
Denoising (Vincent)one forward pass, MSE regressionqσ=pdataN(0,σ2I)q_\sigma = p_{\text{data}} \ast \mathcal{N}(0, \sigma^2 I)high-dim image/audio/video; the standard

The first three rows are equivalent at the population level (modulo constants); they estimate the same target score. DSM trades that equivalence for a tractable conditional regression and learns a smoothed score; this is exactly what diffusion sampling needs, but it is a different estimand from ESM, not a cheaper unbiased estimator of it.

Score matching keeps the vector field and swaps the supervision

The object is always the same: a field that points from low-density regions toward high-density ones. The methods differ only in how they estimate or re-express the target without knowing the partition function.

Explicit targetunknown data score ∇ log p_data(x)Implicit objectivereplace cross term with div s_θ(x)Denoising objectiveregress a noisy conditional scoreoperational readingexplicit: impossible oracle targetimplicit: rewrite the unknown score term as a model divergencedenoising: regress the score of a Gaussian-corrupted density and reuse it at many noise scalesmodern diffusion training lives on the bottom row because it scales

oracle target

Explicit score matching asks for the inaccessible object directly

This form needs ∇ log p_data(x) as supervision. It is conceptually clean but useless in practice because the data score is exactly the thing we do not know.

Same geometry, cheaper supervision

Each variant still tries to learn the vector field that points toward high-density regions. What changes is only the training signal used to estimate that field.

Why diffusion uses the denoising row

High-dimensional image models cannot afford an exact Jacobian trace. Denoising score matching turns the problem into ordinary regression against a noisy target, which scales to modern diffusion training.

Sampling connection

Once the score field is learned, Langevin dynamics, reverse SDEs, or probability-flow ODEs can use that field to move samples back toward the data manifold.

Use the diagram as a map of the three objectives. The left panel keeps the same target geometry in view, while the right panel switches only the training signal: explicit score matching asks for the unknown data score, implicit score matching swaps that for a divergence term via integration by parts, and denoising score matching regresses on a noisy conditional score.

Hyvärinen's Implicit Score Matching

Definition

Score Function

For a differentiable density pp on Rd\mathbb{R}^d, the score (or Stein score) is the gradient of its log density, s(x)=xlogp(x)s(x) = \nabla_x \log p(x). It is invariant to multiplicative constants: if p(x)=p~(x)/Zp(x) = \tilde p(x) / Z then xlogp(x)=xlogp~(x)\nabla_x \log p(x) = \nabla_x \log \tilde p(x). This invariance is what makes score matching a viable training signal for unnormalized models.

Definition

Implicit Score Matching Loss

The implicit score matching loss for a score model sθ:RdRds_\theta : \mathbb{R}^d \to \mathbb{R}^d is

JISM(θ)=Expdata[12sθ(x)2+tr ⁣(xsθ(x))],J_{\text{ISM}}(\theta) = \mathbb{E}_{x \sim p_{\text{data}}} \Big[\, \tfrac{1}{2}\, \|s_\theta(x)\|^2 + \mathrm{tr}\!\left(\nabla_x s_\theta(x)\right) \,\Big],

where xsθ(x)Rd×d\nabla_x s_\theta(x) \in \mathbb{R}^{d \times d} is the Jacobian of sθs_\theta at xx and tr(xsθ)=i=1dxi(sθ)i\mathrm{tr}(\nabla_x s_\theta) = \sum_{i=1}^d \partial_{x_i} (s_\theta)_i is its trace, the divergence of the score field. The loss is computable from samples alone. There is no reference to xlogpdata\nabla_x \log p_{\text{data}}, no normalization constant.

Theorem

Hyvärinen's Implicit Score Matching Theorem

Statement

Under the assumptions above,

12Epdata[sθ(x)xlogpdata(x)2]=Epdata[12sθ(x)2+tr(xsθ(x))]+C(pdata),\tfrac{1}{2}\, \mathbb{E}_{p_{\text{data}}}\big[\|s_\theta(x) - \nabla_x \log p_{\text{data}}(x)\|^2\big] = \mathbb{E}_{p_{\text{data}}}\Big[\tfrac{1}{2} \|s_\theta(x)\|^2 + \mathrm{tr}(\nabla_x s_\theta(x))\Big] + C(p_{\text{data}}),

where C(pdata)=12Epdata[xlogpdata2]C(p_{\text{data}}) = \tfrac{1}{2}\, \mathbb{E}_{p_{\text{data}}} [\|\nabla_x \log p_{\text{data}}\|^2] is the data Fisher information, independent of θ\theta. Minimizing JISMJ_{\text{ISM}} is therefore equivalent to minimizing JESMJ_{\text{ESM}}.

Intuition

The cross term Epdata[sθ(x)xlogpdata(x)]\mathbb{E}_{p_{\text{data}}}[s_\theta(x) \cdot \nabla_x \log p_{\text{data}}(x)] in the explicit loss is what we cannot evaluate. But sθlogp=sθ(p/p)s_\theta \cdot \nabla \log p = s_\theta \cdot (\nabla p / p), so sθ(p/p)pdx=sθpdx\int s_\theta \cdot (\nabla p / p)\, p\, dx = \int s_\theta \cdot \nabla p\, dx. Integration by parts converts sθpdx\int s_\theta \cdot \nabla p\, dx into (sθ)pdx=Epdata[tr(sθ)]-\int (\nabla \cdot s_\theta)\, p\, dx = -\mathbb{E}_{p_{\text{data}}}[\mathrm{tr}(\nabla s_\theta)]. The intractable density gradient becomes a tractable divergence of the model itself.

Proof Sketch

Expand the squared-distance loss: sθlogp2=sθ22sθlogp+logp2\|s_\theta - \nabla \log p\|^2 = \|s_\theta\|^2 - 2\, s_\theta \cdot \nabla \log p + \|\nabla \log p\|^2. The third term integrates to 2C(pdata)2 C(p_{\text{data}}), independent of θ\theta. For the cross term, use logp=p/p\nabla \log p = \nabla p / p to write Ep[sθlogp]=sθpdx\mathbb{E}_p[s_\theta \cdot \nabla \log p] = \int s_\theta \cdot \nabla p\, dx. Apply integration by parts component-wise: (sθ)iipdx=[(sθ)ip]pi(sθ)idx=pi(sθ)idx\int (s_\theta)_i\, \partial_i p\, dx = [(s_\theta)_i\, p]_{-\infty}^{\infty} - \int p\, \partial_i (s_\theta)_i\, dx = -\int p\, \partial_i (s_\theta)_i\, dx (boundary term vanishes by the decay assumption). Summing over ii gives Ep[tr(sθ)]-\mathbb{E}_p[\mathrm{tr}(\nabla s_\theta)]. Substituting back gives 12JESM=12Epsθ2+Ep[tr(sθ)]+C\tfrac{1}{2} J_{\text{ESM}} = \tfrac{1}{2}\mathbb{E}_p\|s_\theta\|^2 + \mathbb{E}_p[\mathrm{tr}(\nabla s_\theta)] + C, which is JISM+CJ_{\text{ISM}} + C.

Why It Matters

This single integration-by-parts identity lets you train a model of the score without ever knowing the data density, without normalization constants, and without samples from the model. It is the foundation that makes energy-based models trainable. For a parametrization sθ(x)=xEθ(x)s_\theta(x) = -\nabla_x E_\theta(x), JISMJ_{\text{ISM}} depends on EθE_\theta only through its gradients, never through Z(θ)Z(\theta). Every later score-matching method (denoising, sliced, multi-noise) is a way of estimating the same loss more efficiently.

Failure Mode

The Jacobian trace tr(xsθ)\mathrm{tr}(\nabla_x s_\theta) requires dd backward passes to compute exactly (one per coordinate of sθs_\theta). For d104d \sim 10^4 to 10610^6 (images, video, audio) this is prohibitive. The implicit loss is theoretically clean but computationally infeasible in high dimensions. The decay assumption also fails for distributions on bounded support (e.g., images in [0,1]d[0,1]^d), creating uncontrolled boundary terms; Hyvärinen (2007) extends the framework with reflection or transformation tricks for non-negative data, but the standard Gaussian-tail assumption is genuinely restrictive.

Sliced Score Matching

Theorem

Sliced Score Matching (Song-Garg-Shi-Ermon 2020)

Statement

Define the sliced score matching loss

JSSM(θ)=ExpdataEvpv[12(vsθ(x))2+vxsθ(x)v].J_{\text{SSM}}(\theta) = \mathbb{E}_{x \sim p_{\text{data}}}\, \mathbb{E}_{v \sim p_v} \Big[\,\tfrac{1}{2}\, (v^\top s_\theta(x))^2 + v^\top \nabla_x s_\theta(x)\, v\, \Big].

Then JSSMJ_{\text{SSM}} is an unbiased estimator of JISMJ_{\text{ISM}} up to a θ\theta-independent constant. In particular, Ev[vxsθ(x)v]=tr(xsθ(x))\mathbb{E}_v[v^\top \nabla_x s_\theta(x)\, v] = \mathrm{tr}(\nabla_x s_\theta(x)) (Hutchinson's trace estimator) and Ev[(vsθ(x))2]=sθ(x)2\mathbb{E}_v[(v^\top s_\theta(x))^2] = \|s_\theta(x)\|^2.

Intuition

The Jacobian trace is ii(sθ)i\sum_i \partial_i (s_\theta)_i, which costs dd backward passes. Hutchinson's trick replaces the deterministic trace with the expectation over a random direction vv of the bilinear form vsθvv^\top \nabla s_\theta\, v, which costs one Jacobian-vector product (a single extra backward pass through vsθv^\top s_\theta). The variance-reducing payoff is that you no longer need to enumerate coordinates: high dimension stops being an obstacle.

Proof Sketch

For any matrix ARd×dA \in \mathbb{R}^{d \times d} and random vv with E[vv]=Id\mathbb{E}[vv^\top] = I_d, E[vAv]=E[tr(Avv)]=tr(AE[vv])=tr(A)\mathbb{E}[v^\top A v] = \mathbb{E}[\mathrm{tr}(A vv^\top)] = \mathrm{tr}(A\, \mathbb{E}[vv^\top]) = \mathrm{tr}(A). Apply with A=xsθ(x)A = \nabla_x s_\theta(x) for the trace term. Similarly, E[vuvu]=uE[vv]u=u2\mathbb{E}[v^\top u\, v^\top u] = u^\top \mathbb{E}[vv^\top] u = \|u\|^2 for u=sθ(x)u = s_\theta(x), recovering the squared-norm term. Substituting in JISMJ_{\text{ISM}} yields JSSMJ_{\text{SSM}}.

Why It Matters

Sliced score matching preserves the noise-free training signal of implicit score matching while eliminating the O(d)O(d) scaling. It is the right estimator when noise injection is undesirable: discrete data after relaxation, structured data with manifold support, or settings where you care about the score of pdatap_{\text{data}} directly rather than a noisy qσq_\sigma. Variants (sliced score matching with variance reduction, SSM-VR) further cut the variance via control variates.

Failure Mode

The single-direction estimator has high variance per sample. Practitioners often average over KK random projections per data point, recovering some of the cost of full trace evaluation. For very high-dimensional data (megapixel images), denoising score matching still wins on cost per fixed-quality gradient because its variance comes from data sampling rather than from projection sampling, and a noise-conditional model is what downstream samplers actually need.

Denoising Score Matching

Theorem

Vincent's Denoising Score Matching Identity

Statement

Define the denoising score matching loss

JDSM(θ)=12Expdata,x~qσ(x)[sθ(x~)x~logqσ(x~x)2].J_{\text{DSM}}(\theta) = \tfrac{1}{2}\, \mathbb{E}_{x \sim p_{\text{data}},\, \tilde{x} \sim q_\sigma(\cdot | x)} \big[\, \|s_\theta(\tilde{x}) - \nabla_{ \tilde{x}} \log q_\sigma(\tilde{x} | x)\|^2\, \big].

Then JDSM(θ)=JESMqσ(θ)+C(qσ)J_{\text{DSM}}(\theta) = J_{\text{ESM}}^{q_\sigma}(\theta) + C(q_\sigma), where JESMqσJ_{\text{ESM}}^{q_\sigma} is the explicit score matching loss against the noisy marginal qσ(x~)q_\sigma(\tilde{x}) and CC is independent of θ\theta. Minimizing JDSMJ_{\text{DSM}} trains sθs_\theta to match the score of the noisy data distribution qσq_\sigma.

Intuition

For Gaussian noise x~=x+σϵ\tilde{x} = x + \sigma \epsilon, the conditional score is x~logqσ(x~x)=(x~x)/σ2=ϵ/σ\nabla_{\tilde{x}} \log q_\sigma(\tilde{x} | x) = -(\tilde{x} - x) / \sigma^2 = -\epsilon / \sigma, which is just the negative noise direction scaled by 1/σ1/\sigma. The DSM loss reduces to Esθ(x~)+ϵ/σ2\mathbb{E}\|s_\theta(\tilde{x}) + \epsilon / \sigma\|^2 (up to constants), which is plain least-squares regression of the model output onto the noise that was added. The Jacobian trace disappears entirely; you only need a forward pass and a regression target.

Proof Sketch

Expand the explicit score matching loss against qσ(x~)q_\sigma(\tilde{x}) and substitute qσ(x~)=Expdata[qσ(x~x)]q_\sigma(\tilde{x}) = \mathbb{E}_{x \sim p_{\text{data}}} [q_\sigma(\tilde{x} | x)]. The cross term Eqσ[sθlogqσ]\mathbb{E}_{q_\sigma}[s_\theta \cdot \nabla \log q_\sigma] becomes Ex,x~[sθ(x~)x~logqσ(x~x)]\mathbb{E}_{x, \tilde{x}}[s_\theta (\tilde{x}) \cdot \nabla_{\tilde{x}} \log q_\sigma(\tilde{x} | x)] via x~qσ(x~)=Ex[x~qσ(x~x)]\nabla_{\tilde{x}} q_\sigma(\tilde{x}) = \mathbb{E}_x[\nabla_{\tilde{x}} q_\sigma(\tilde{x} | x)] (differentiating under the integral sign) followed by Bayes' rule. Algebraic completion of the square then converts the expanded explicit loss into the DSM loss plus a θ\theta-independent constant.

Why It Matters

Denoising score matching scales to image, audio, and video models because its loss is a single regression target per sample: no Jacobians, no divergences, no second-order quantities. This is why diffusion training is architecturally identical to supervised regression: input is a noisy image, target is the noise vector, loss is MSE. Every pixel-space and latent-space diffusion model in production trains exactly this objective across many noise scales. Without Vincent's identity, score-based generative modeling at scale would not exist.

Failure Mode

DSM trains the score of the noisy density qσq_\sigma, not the original pdatap_{\text{data}}. For small σ\sigma, the noisy score approximates the data score, but only away from the data support, since the noisy density smooths out the manifold structure of the data. Single-noise-level DSM therefore cannot generate clean samples; you need a schedule of noise levels (NCSN) or a continuous-time SDE noising process (Score-SDE) and a score model conditioned on σ\sigma. The single-scale version is provably a bad sampler in high dimensions because Langevin transitions across disconnected high-density modes are exponentially slow when σ\sigma is small.

Tweedie's Formula and the Posterior-Mean Denoiser

The score matching framework gives you a vector field, but it does not immediately tell you what that field means for a downstream task like denoising or sampling. Tweedie's formula (Robbins 1956; Efron 2011 for the modern statement) supplies the operational link.

Theorem

Tweedie's Formula

Statement

Under the Gaussian-noising model above,

E[xx~]=x~+σ2x~logqσ(x~).\mathbb{E}[\,x \mid \tilde{x}\,] = \tilde{x} + \sigma^2\, \nabla_{\tilde{x}} \log q_\sigma(\tilde{x}).

The posterior mean of the clean signal given the noisy observation equals the noisy observation plus σ2\sigma^2 times the score of the noisy density.

Intuition

The score of the noisy density points from low-probability noisy regions toward high-probability noisy regions. Heuristically, the score points "toward where the clean data probably lives" given what was observed. The σ2\sigma^2 scaling is exactly the correction needed so that the shift recovers the Bayes-optimal denoiser, not just a direction.

Proof Sketch

Differentiate qσ(x~)=qσ(x~x)pdata(x)dxq_\sigma(\tilde{x}) = \int q_\sigma(\tilde{x} | x)\, p_{\text{data}}(x)\, dx in x~\tilde{x}: x~qσ(x~)=x~qσ(x~x)pdata(x)dx\nabla_{\tilde{x}} q_\sigma(\tilde{x}) = \int \nabla_{\tilde{x}} q_\sigma (\tilde{x} | x)\, p_{\text{data}}(x)\, dx. For Gaussian qσ(x~x)q_\sigma(\tilde{x} | x), x~qσ(x~x)=x~xσ2qσ(x~x)\nabla_{\tilde{x}} q_\sigma(\tilde{x} | x) = -\frac{\tilde{x} - x}{\sigma^2}\, q_\sigma(\tilde{x} | x). Substitute and divide by qσ(x~)q_\sigma(\tilde{x}): x~logqσ(x~)=1σ2E[x~xx~]=x~E[xx~]σ2\nabla_{\tilde{x}} \log q_\sigma(\tilde{x}) = -\frac{1}{\sigma^2}\, \mathbb{E}[\tilde{x} - x \mid \tilde{x}] = -\frac{\tilde{x} - \mathbb{E}[x \mid \tilde{x}]}{\sigma^2}. Rearranging gives the formula.

Why It Matters

Tweedie identifies the trained score network with a posterior-mean denoiser: x^θ(x~,σ)=x~+σ2sθ(x~,σ)\hat{x}_\theta(\tilde{x}, \sigma) = \tilde{x} + \sigma^2\, s_\theta(\tilde{x}, \sigma). This is why "diffusion" and "denoising" are interchangeable framings; they both train the same network, parameterized differently. Karras et al. (2022) make this explicit in EDM: the network outputs a denoised estimate DθD_\theta, and the score is recovered as sθ=(Dθx~)/σ2s_\theta = (D_\theta - \tilde{x}) / \sigma^2. The same identity also underpins diffusion posterior sampling methods (DPS, Chung et al. 2023) that reuse a pre-trained score network for inverse problems via Tweedie's mean as a plug-in posterior estimate.

Failure Mode

Tweedie's mean is a first-moment statement; it does not characterize the posterior shape. For multimodal posteriors (e.g., a noisy image consistent with several plausible clean images), the mean is the average of the modes and is not itself a sample from the posterior. Diffusion samplers do not use Tweedie as a one-shot denoiser for this reason; they walk the posterior across noise scales and the posterior mean is informative only locally. Also, Tweedie holds exactly for Gaussian noise; for other noise families, the generalized Tweedie correction involves higher-order derivatives and is rarely as clean.

The DDPM ε-prediction parameterization (Ho-Jain-Abbeel 2020) is one specific choice in the family Tweedie's formula opens up. Karras et al. (2022) catalog the practical alternatives (score prediction sθs_\theta, ϵ\epsilon-prediction, denoiser prediction DθD_\theta, and vv-prediction from Salimans-Ho 2022) and show they differ only in how loss-weighting interacts with the noise schedule. Across the family, Tweedie's identity is the algebraic glue.

One Target, Several Parameterizations

Under Gaussian noising, the same learned object can be written in several different heads. This is why diffusion papers can talk past one another while still training essentially the same model family.

Network outputRelation to the noisy scoreWhat the loss looks like
score head sθ(x~,σ)s_\theta(\tilde{x}, \sigma)directly approximates x~logqσ(x~)\nabla_{\tilde{x}} \log q_\sigma(\tilde{x})explicit score regression
noise head ϵθ(x~,σ)\epsilon_\theta(\tilde{x}, \sigma)sθ(x~,σ)=ϵθ(x~,σ)/σs_\theta(\tilde{x}, \sigma) = -\epsilon_\theta(\tilde{x}, \sigma) / \sigmaDDPM-style MSE on added noise
denoiser head Dθ(x~,σ)D_\theta(\tilde{x}, \sigma)sθ(x~,σ)=(Dθ(x~,σ)x~)/σ2s_\theta(\tilde{x}, \sigma) = (D_\theta(\tilde{x}, \sigma) - \tilde{x}) / \sigma^2posterior-mean denoiser via Tweedie

The object that changes least across papers is the noisy score field. What changes is the algebraic parameterization, the noise schedule, and the weighting of the per-scale loss. The popular vv-prediction parameterization is another schedule-dependent linear rewrite of the same family, mainly used to improve optimization and numerical stability in VP-style diffusion models.

Consistency and Statistical Properties

Theorem

Asymptotic Consistency of the Score Matching Estimator

Statement

Let θ^n=argminθJ^ISM(θ)\hat{\theta}_n = \arg\min_\theta \hat{J}_{\text{ISM}}(\theta) be the sample minimizer of the implicit score matching objective on nn iid samples from pdatap_{\text{data}}. Under the assumptions above, θ^n\hat{\theta}_n is consistent: θ^nθ\hat{\theta}_n \to \theta^* in probability as nn \to \infty. Furthermore, θ^n\hat{\theta}_n is asymptotically normal with a covariance matrix that is in general strictly larger (in PSD order) than the maximum-likelihood Cramér-Rao bound.

Intuition

The population JISMJ_{\text{ISM}} is uniquely minimized at θ\theta^* by the identification assumption: JISM(θ)=JESM(θ)C=CJ_{\text{ISM}}(\theta^*) = J_{\text{ESM}}(\theta^*) - C = -C achieves its minimum. Standard M-estimator theory transfers consistency from the population minimizer to the sample minimizer. The efficiency loss relative to MLE is the price you pay for sidestepping the partition function: you replace the score function of the likelihood with a different (but still unbiased) estimating equation.

Proof Sketch

Hyvärinen (2005, Theorem 2) proves consistency under the regularity conditions above using standard M-estimator arguments. For asymptotic normality and the efficiency comparison to MLE, Hyvärinen's 2005 analysis and follow-up work on structured exponential-family settings make the same point: score matching is a valid M-estimator, but not an efficient replacement for maximum likelihood. The result generalizes to denoising and sliced variants under matching regularity, with the efficiency gap depending on noise scale and projection variance respectively.

Why It Matters

Consistency is what justifies score matching as a real estimator, not just a heuristic. In the well-specified setting it converges to the truth with infinite data, and the rate is parametric (n1/2n^{-1/2}). The efficiency loss relative to MLE matters in low-data regimes but is irrelevant for diffusion models trained on tens of millions of images. In the misspecified setting (model family does not contain the truth), score matching converges to the θ\theta that minimizes the score-matching divergence rather than KL, and the two minima can be substantially different. This is one reason score-matched models sometimes look qualitatively different from MLE-fit models even on toy problems.

Failure Mode

The consistency result requires the boundary terms in the integration-by- parts derivation to vanish, which fails for densities with bounded support or sharp cutoffs. It also requires the model to be twice differentiable in θ\theta, which is a non-issue for neural networks at typical points but a real concern at activation kinks. Identifiability can fail silently for overparameterized models; in that case score matching converges to a set, not a point, and the asymptotic-normality statement does not apply.

NCSN and the Multi-Noise Trick

A single noise scale leaves an irreducible bias-variance tradeoff. Small σ\sigma keeps the noisy score close to the data score but leaves the disconnected-modes problem; large σ\sigma smooths across modes but trains a score that has been pulled far from the data manifold. Song and Ermon (2019) resolved this with the Noise-Conditional Score Network (NCSN): train a single network sθ(x,σ)s_\theta(x, \sigma) on a geometric schedule of noise scales σ1>σ2>>σL\sigma_1 > \sigma_2 > \cdots > \sigma_L, weighting each scale's loss by λ(σ)=σ2\lambda(\sigma) = \sigma^2 to balance the contribution. At sample time, run annealed Langevin dynamics: start from pure noise at scale σ1\sigma_1, run several Langevin steps at each scale, then anneal to the next.

The Score-SDE framework (Song et al. 2021) takes the continuous-time limit: the discrete schedule becomes a forward SDE dx=f(x,t)dt+g(t)dWdx = f(x, t)\, dt + g(t)\, dW, the per-scale objectives become a single time-integrated DSM loss

L(θ)=EtEx0,xt[λ(t)sθ(xt,t)xtlogqt(xtx0)2],\mathcal{L}(\theta) = \mathbb{E}_t\, \mathbb{E}_{x_0, x_t}\big[\, \lambda(t)\, \|s_\theta(x_t, t) - \nabla_{x_t} \log q_t(x_t \mid x_0)\|^2 \,\big],

and Anderson's time-reversal theorem supplies the reverse SDE that turns the trained score into a sampler. DDPM falls out as the variance-preserving (VP) SDE; NCSN falls out as the variance-exploding (VE) SDE; latent diffusion (Stable Diffusion) is the same loss applied to the output of a pretrained VAE encoder.

Worked Example: Gaussian Score Matching

Take pdata=N(μ,Σ)p_{\text{data}} = \mathcal{N}(\mu, \Sigma) and a linear score model sθ(x)=A(xb)s_\theta(x) = A(x - b) for ARd×dA \in \mathbb{R}^{d \times d}, bRdb \in \mathbb{R}^d. The true score is logp(x)=Σ1(xμ)\nabla \log p(x) = -\Sigma^{-1}(x - \mu), so the optimum is A=Σ1A^* = -\Sigma^{-1}, b=μb^* = \mu.

Compute the implicit score matching loss: 12A(xb)2+tr(A)=12(xb)AA(xb)+tr(A)\tfrac{1}{2}\|A(x - b)\|^2 + \mathrm{tr}(A) = \tfrac{1}{2}(x - b)^\top A^\top A\, (x - b) + \mathrm{tr}(A). Take expectation under pdatap_{\text{data}}: E[12(xb)AA(xb)]=12tr(AAΣ)+12(μb)AA(μb)\mathbb{E}[\tfrac{1}{2}(x - b)^\top A^\top A (x - b)] = \tfrac{1}{2}\mathrm{tr}(A^\top A\, \Sigma) + \tfrac{1}{2}(\mu - b)^\top A^\top A (\mu - b).

Setting the gradient in bb to zero gives b=μb^* = \mu. Setting the gradient in AA to zero gives AΣ+I=0A^* \Sigma + I = 0, i.e., A=Σ1A^* = -\Sigma^{-1}. The implicit loss recovers the exact maximum-likelihood Gaussian fit, with no direct access to the density. This is the cleanest demonstration of the Hyvärinen identity in closed form, and it confirms the consistency result constructively for the Gaussian family.

Common Confusions

Watch Out

This score is a gradient in x, not in θ

In classical statistics, the word score often means the likelihood score θlogpθ(x)\nabla_\theta \log p_\theta(x), a derivative with respect to model parameters. In score matching, the score is xlogp(x)\nabla_x \log p(x), a vector field over data space. These are different objects with different dimensions, different roles, and different theorems. Score matching trains the data-space score field; it is not solving the likelihood score equation in parameter space.

Watch Out

Score matching is not minimizing KL to the data distribution

Score matching minimizes the squared L2L^2 distance between score fields, not the KL divergence between distributions. The two objectives have different gradients and different optima in general. They agree at the global minimum (both achieve zero only when pθ=pdatap_\theta = p_{\text{data}}), but the geometry of the optimization landscape is different. In particular, score matching is consistent under model well-specification, but the bias-variance tradeoff in finite samples differs from MLE, and on misspecified models the two converge to different projections of pdatap_{\text{data}} onto the model family.

Watch Out

The Jacobian trace is not the same as the model output's norm

Beginners sometimes confuse tr(sθ)\mathrm{tr}(\nabla s_\theta) with sθ2\|s_\theta\|^2 or with the Jacobian's Frobenius norm. The trace is the divergence of the score field, ixi(sθ)i\sum_i \partial_{x_i} (s_\theta)_i, which can be negative, zero, or positive independently of the score's magnitude. The implicit score matching loss has both terms, and the trace term is what makes the loss work; without it, the optimum would be sθ0s_\theta \equiv 0.

Watch Out

Denoising score matching does not train the data score directly

DSM trains sθ(x~)x~logqσ(x~)s_\theta(\tilde{x}) \approx \nabla_{\tilde{x}} \log q_\sigma (\tilde{x}), the score of the noisy data density qσq_\sigma. As σ0\sigma \to 0, qσpdataq_\sigma \to p_{\text{data}} and the DSM target approaches the data score, but for any fixed σ>0\sigma > 0 they are different. To learn the data score itself, you need a schedule of σ\sigma values approaching zero, or a continuous-time SDE formulation where σ(t)\sigma(t) is part of the model.

Watch Out

Score matching does not require samples from the model

Unlike contrastive divergence and other EBM training methods, score matching needs no MCMC sampling from pθp_\theta during training. The only data it consumes are samples from pdatap_{\text{data}}. This is what makes it practically attractive for high-dimensional EBMs where MCMC mixing is the bottleneck. The price is the Jacobian trace (implicit), the projection variance (sliced), or the smoothing bias (denoising), but never an expectation over the model.

Watch Out

Tweedie's formula gives a posterior mean, not a sample

E[xx~]=x~+σ2logqσ\mathbb{E}[x \mid \tilde{x}] = \tilde{x} + \sigma^2\, \nabla \log q_\sigma is a mean, not a draw. For a noisy image consistent with several plausible clean images, the Tweedie estimate is an average of those modes, often a blurry interpolation that is not itself a plausible clean image. Diffusion samplers walk the posterior across noise scales precisely because the posterior is multimodal and a one-shot mean is not a generative sample.

Summary

  • Score matching learns a data-space score field, not the parameter-space likelihood score.
  • Hyvarinen's theorem removes the unknown data score by trading it for the divergence of the model score field.
  • Sliced score matching keeps the same target but estimates the divergence stochastically with random projections.
  • Denoising score matching changes the target to the score of a noisy marginal, turning training into a regression problem.
  • Tweedie's formula identifies that noisy score with a posterior-mean denoiser under Gaussian corruption.
  • Modern diffusion models are multi-scale denoising score matching plus a reverse-time sampler.

Exercises

ExerciseCore

Problem

Derive the denoising score matching target for Gaussian noise: x~=x+σϵ\tilde{x} = x + \sigma \epsilon with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) and xpdatax \sim p_{\text{data}}. Show that the DSM loss reduces, up to multiplicative and additive constants in θ\theta, to Ex,ϵσsθ(x~)+ϵ2\mathbb{E}_{x, \epsilon}\,\|\sigma\, s_\theta(\tilde{x}) + \epsilon\|^2.

ExerciseAdvanced

Problem

Prove Tweedie's formula: for x~=x+σϵ\tilde{x} = x + \sigma \epsilon with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) independent of xpdatax \sim p_{\text{data}} and qσ(x~)=qσ(x~x)pdata(x)dxq_\sigma(\tilde{x}) = \int q_\sigma(\tilde{x} | x) p_{\text{data}}(x) \,dx,

E[xx~]=x~+σ2x~logqσ(x~).\mathbb{E}[x \mid \tilde{x}] = \tilde{x} + \sigma^2\, \nabla_{\tilde{x}} \log q_\sigma(\tilde{x}).
ExerciseAdvanced

Problem

Show that the sliced score matching estimator with vN(0,Id)v \sim \mathcal{N}(0, I_d) is unbiased for JISMJ_{\text{ISM}}. Compute the variance of its single-direction trace estimator vsθvv^\top \nabla s_\theta\, v and explain why averaging over KK directions reduces variance by a factor of KK.

ExerciseAdvanced

Problem

Consider score matching applied to an energy-based model pθ(x)=eEθ(x)/Z(θ)p_\theta(x) = e^{-E_\theta(x)} / Z(\theta) with sθ(x)=xEθ(x)s_\theta(x) = -\nabla_x E_\theta(x). Show that the implicit score matching loss JISM(θ)J_{\text{ISM}}(\theta) depends on EθE_\theta only through its first and second derivatives, and explicitly never through Z(θ)Z(\theta).

ExerciseResearch

Problem

Single-noise-level DSM is provably a poor sampler in high dimensions because of disconnected-modes mixing problems for Langevin dynamics. NCSN solves this with a geometric noise schedule and annealed Langevin. Sketch why a linear (rather than geometric) schedule of σ\sigma values is suboptimal for sampling, and give an intuition based on how the data manifold's log-density curvature scales with noise.

References

Canonical:

  • Hyvärinen, Estimation of non-normalized statistical models by score matching (Journal of Machine Learning Research 6, 2005). The original paper. Introduces implicit score matching, the integration-by-parts identity, and the consistency theorem.
  • Vincent, A connection between score matching and denoising autoencoders (Neural Computation 23, 2011). Establishes the equivalence between DSM and explicit score matching against the noisy marginal qσq_\sigma.
  • Song, Garg, Shi, and Ermon, Sliced score matching: a scalable approach to density and score estimation (UAI 2020), arXiv:1905.07088. Hutchinson's-trick estimator that replaces the O(d)O(d) Jacobian trace with one Jacobian-vector product.
  • Robbins, An empirical Bayes approach to statistics (Berkeley Symposium 1956), Section 4. The original derivation of Tweedie's formula in the empirical Bayes setting; the score-as-denoiser identity predates score matching by 50 years.
  • Efron, Tweedie's formula and selection bias (Journal of the American Statistical Association 106, 2011). Modern restatement, generalizations to non-Gaussian noise, and extensive discussion of statistical applications.

Current:

  • Song and Ermon, Generative modeling by estimating gradients of the data distribution (NeurIPS 2019), arXiv:1907.05600. The NCSN paper. Multi-scale score matching with annealed Langevin sampling.
  • Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole, Score-based generative modeling through stochastic differential equations (ICLR 2021), arXiv:2011.13456. Continuous-time SDE formulation that unifies NCSN, DDPM, and probability-flow ODEs under one score-matching framework.
  • Ho, Jain, and Abbeel, Denoising diffusion probabilistic models (NeurIPS 2020), arXiv:2006.11239. Reparameterizes DSM as ϵ\epsilon-prediction; the loss form used in virtually all modern image diffusion models.
  • Karras, Aittala, Aila, and Laine, Elucidating the design space of diffusion-based generative models (NeurIPS 2022), arXiv:2206.00364. Treats the score network as a posterior-mean denoiser via Tweedie's formula and analyzes loss-weighting across noise scales.
  • Salimans and Ho, Progressive distillation for fast sampling of diffusion models (ICLR 2022), arXiv:2202.00512. Introduces the vv-prediction parameterization and shows it is more numerically stable than ϵ\epsilon-prediction at large σ\sigma.

Extensions:

  • Hyvärinen, Some extensions of score matching (Computational Statistics and Data Analysis 51, 2007). Extends the framework to non-negative data and bounded supports via reflection / transformation tricks.
  • Lyu, Interpretation and generalization of score matching (UAI 2009). Connects score matching to a family of generalized scoring rules and discusses regularization.
  • Chung, Kim, Mccann, Klasky, and Ye, Diffusion posterior sampling for general noisy inverse problems (ICLR 2023), arXiv:2209.14687. Uses Tweedie's mean as a plug-in posterior estimator inside a guided diffusion sampler.
  • Song and Kingma, How to train your energy-based models (arXiv:2101.03288, 2021). Reviews score matching, contrastive divergence, and noise-contrastive estimation side by side as EBM training methods.

Next Topics

  • Diffusion Models: the generative modeling framework built on multi-scale denoising score matching with Anderson's reverse-time SDE.
  • Time Reversal of SDEs: Anderson's theorem that turns the learned score into a generative sampler.
  • Langevin Dynamics: the SDE that the learned score field drives at sampling time.
  • Energy-Based Models: the unnormalized-density framework that motivates score matching in the first place.
  • Fokker-Planck Equation: the PDE that governs the noisy marginals qσ(x~)q_\sigma(\tilde{x}) along the noising schedule.
  • Stochastic Differential Equations: the continuous-time framework for the forward noising process.

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Derived topics

4