Beyond LLMS
Diffusion Models
Generative models that learn to reverse a noise-adding process: forward and reverse SDEs, score matching, the DDPM ELBO, deterministic ODE sampling, classifier-free guidance, latent diffusion, and the bridge to flow matching.
Prerequisites
Why This Matters
Diffusion keeps the corruption process fixed and spends all learnable capacity on reversing it
Training is cheap because any noisy state can be sampled in closed form. Sampling is expensive because the model has to walk a noisy prior back toward structure one denoising move at a time.
Forward process
Known Gaussian corruption
Moves left to right from clean structure toward nearly pure noise.
clean
closed form
high noise
prior
Reverse process
Learned denoising path
Moves left to right from a noise prior back toward a coherent sample candidate.
noise prior
denoising field
structure returns
sample
The upper lane is analytic: sample a timestep and corrupt in one shot. The lower lane is what the network earns: a denoising field that keeps telling the sampler how to turn noise back into signal.
Training target
Sample , draw , then form without simulating the whole chain.
Equivalent parameterizations
Predicting , the score , or a velocity-style target all describe the same denoising direction once you rescale by the known noise level.
What the sampler actually changes
DDPM keeps stochasticity, DDIM makes the path deterministic, and newer ODE solvers buy quality at lower step counts. The model is the same; the numerical path is what changes.
Why guidance helps and hurts
Classifier-free guidance extrapolates with . Moderate sharpens the prompt; large pushes the sampler off the density it actually learned.
Diffusion is the dominant generative architecture for images, video, audio, 3D geometry, molecules, and robot policies as of 2025-2026. The frontier image systems (Stable Diffusion 3.5, Imagen 3, Flux.1, DALL-E 3, Midjourney v6) and the frontier video systems (Sora, Veo 3, Kling, Runway Gen-3) are all either diffusion models or their close relative, flow matching. Outside generative media, diffusion appears in protein-structure design, robotics action policies (Chi et al. 2023), molecular conformer generation (Xu et al. 2022), and decision-making.
Three properties separate diffusion from earlier generative families:
- Stable training. Unlike GANs, diffusion has no minimax game. The loss is a weighted regression against a known noise target. Training scales monotonically with compute and data.
- Mode coverage. Unlike VAEs and GANs, diffusion does not aggressively collapse modes. The score-matching objective averages over the data distribution rather than fighting it.
- Likelihood-grade theory. The discrete formulation gives a variational ELBO upper bound on negative log-likelihood (Sohl-Dickstein 2015, Ho et al. 2020). The objective most diffusion models actually train on, , drops the per-timestep weights and is not the tight ELBO; it is a reweighted surrogate that improves sample quality at the cost of bound tightness. The continuous-time formulation gives exact likelihoods through an ODE change-of-variables (Song et al. 2021), independent of how the training loss was weighted.
The price is iterative sampling: a single image requires 4-50 forward passes through a large model. Most of the engineering progress since 2020 has been about cutting that cost.
If you want to watch timestep, guidance scale, and sampler choice change that tradeoff directly, open the Diffusion Lab.
Mental Model
Take a clean image, gradually add Gaussian noise over many steps until the image is pure noise, and then learn to undo each step. The forward process is fixed and analytically tractable; only the reverse direction is learned. At sampling time, draw a Gaussian noise vector and integrate the learned reverse dynamics back to a clean image.
There are several common parameterizations for what the network actually predicts: the noise that was added (DDPM, Ho et al. 2020), the score of the perturbed data (NCSN, Song & Ermon 2019), the clean image , or the linear combination (Salimans-Ho 2022). At a fixed noise level these targets are linear functions of each other, so the algebraic relationship is exact (Vincent 2011 gives the denoising-score identity). What the trained network ends up computing is not identical across choices: the implicit per-timestep loss weighting differs, which changes which noise levels the model is optimized for, and that interacts with the choice of sampler at inference time. The parameterization is a real design choice, not just a notational rewrite — schedule, weighting convention, and sampler all matter.
Formal Setup
Forward Process (Discrete-Time DDPM)
The forward process is a Markov chain that adds Gaussian noise over steps. Starting from :
where is a fixed variance schedule. With and , the marginal at any step has a closed form:
so with . As and the schedule is chosen so that , the prior converges to .
Reverse Process (Learned Markov Kernel)
The reverse process is a parametric Markov chain trained to undo the forward process:
In Ho et al. (2020) the variance is fixed to (with ), and the mean is reparameterized via a noise predictor :
Nichol & Dhariwal (2021) introduced learned for tighter likelihood bounds.
Score Function
The score of a density is
A score model targets the score of the perturbed distribution at noise level . The score does not require knowing the normalizing constant of , which is what makes it tractable to learn from samples alone (Hyvärinen 2005).
Forward SDE (Continuous-Time)
A continuous-time forward process is an Itô SDE on :
with drift , scalar diffusion coefficient , and standard Wiener process . The two canonical choices are Variance Preserving (VP-SDE), and , the continuous limit of DDPM, and Variance Exploding (VE-SDE), and , the continuous limit of NCSN (Song et al. 2021).
Main Theorems
DDPM Variational Lower Bound
Statement
The negative log-likelihood admits the variational upper bound
with
When the reverse mean is reparameterized via a noise predictor and the variance is fixed, each reduces to a weighted squared error. Dropping the timestep weights gives the simplified DDPM loss of Ho et al. (2020):
Intuition
Training reduces to: sample a clean image , sample a timestep and a noise vector , form the noisy view in closed form, and regress the network output against the noise. No sampling of long chains during training, no GAN-style adversary, no per-step backpropagation through time.
Proof Sketch
Apply the standard ELBO decomposition to :
- Write by Jensen.
- Use Bayes to rewrite the forward chain as the time-reversed , which is a closed-form Gaussian because the forward kernel is Gaussian with known means.
- Group terms by : prior at , KL between true and learned reverse kernel for , and reconstruction at .
- Each KL between two Gaussians with the same fixed variance reduces to a squared-mean difference. Reparameterizing the mean as folds the -dependent constants into a single timestep weight .
- The simplified loss sets . Karras et al. (2022) note that uniform timestep sampling implicitly weights by SNR-dependent factors that happen to align with perceptual quality, which is why outperforms the tight ELBO for sample quality despite being looser as a likelihood bound.
Why It Matters
This is why diffusion training is engineering-friendly. The loss is a single mean-squared-error over noise predictions, parallelized across timesteps and data points. The variational bound certifies that minimizing pushes a valid upper bound on , even though the bound itself is loose.
Failure Mode
underweights large- steps for likelihood purposes. If the goal is reporting NLL or BPD (e.g., density estimation benchmarks), use the variance-weighted variational loss or the EDM parameterization (Karras et al. 2022). For sample quality on standard benchmarks, is the strong baseline.
Denoising Score Matching Equivalence
Statement
For perturbation kernel and the induced marginal , the denoising score matching objective
is equal, up to a -independent constant, to the marginal score matching objective .
Equivalently, the noise-prediction objective and the score model are linked by
so denoising and score learning are the same algorithm.
Intuition
You cannot evaluate for the data distribution because is unknown. But after corrupting with known Gaussian noise, the conditional score is just the negative noise direction, which the network can learn from pairs. The marginal score follows by an integration-by-parts argument that shifts the expectation onto the network output.
Why It Matters
This is the bridge between the DDPM view (denoise pixels) and the score-SDE view (estimate ). They are not two methods; they are one method with two parameterizations. The same trained network drives both discrete-time DDPM samplers and continuous-time SDE/ODE solvers.
Failure Mode
The equivalence assumes a Gaussian perturbation kernel. With non-Gaussian corruption (e.g., cold diffusion or masking), the conditional-score formula changes and the simple "predict the noise" target no longer applies (Bansal et al. 2022).
Anderson Reverse-Time SDE
Statement
The forward SDE on has a reverse-time SDE with the same marginal distributions:
where is a standard Wiener process running backward in time and the SDE is integrated from down to .
Intuition
Forward dynamics push samples from data to noise; the reverse correction term is the "uphill" force that sends them back to high-density regions. The score is the only data-dependent quantity, and the diffusion coefficient is unchanged.
Proof Sketch
Anderson (1982) proves this by writing the joint for , conditioning to obtain the time-reversed transition density, and verifying it satisfies the reverse-time Fokker-Planck equation. Modern proofs (Haussmann & Pardoux 1986, Föllmer 1985) start from the time-reversal of the SDE directly. The intuition is that the forward Fokker-Planck equation for involves a drift term and a Laplacian term; reversing time flips the sign of the drift, and the Laplacian becomes a drift correction whose magnitude is exactly .
Why It Matters
This is the foundational result that makes score-based generative modeling work. Train a neural network to approximate at every , plug it into the reverse SDE, and integrate from to starting from Gaussian noise. Score-SDE (Song et al. 2021) unifies DDPM and NCSN as discrete solvers for VP- and VE-flavored versions of this same equation.
Failure Mode
The score model is least accurate near (data manifold has near-zero density off the manifold and the score blows up) and near (the noisy distribution is nearly Gaussian and the score is small but easy to mispredict in direction). EDM (Karras et al. 2022) handles both ends with explicit preconditioning of the network input/output.
Probability Flow ODE
Statement
The marginal densities of the forward SDE are also the marginals of the deterministic ODE
Integrating this ODE backward in time from a sample produces a sample without any injected noise. The ODE is a deterministic, invertible bijection between the prior and the data distribution at the level of single trajectories.
Intuition
Removing half the diffusion coefficient and dropping the Wiener increment preserves the family of marginal densities while removing all randomness from sample paths. Stochastic and deterministic samplers see different per-trajectory dynamics but the same marginal distribution at every .
Why It Matters
Three consequences land here. First, deterministic sampling: DDIM (Song, Meng, Ermon 2020) is a discretization of this ODE for the VP-SDE, which is why it samples in 10-50 steps instead of 1000. Second, exact likelihood evaluation: integrating the ODE forward gives a continuous normalizing flow, and the change-of-variables formula yields exact log-likelihoods (Song et al. 2021). Third, interpolation and editing: noise-to-sample is bijective along each trajectory, so editing in noise space corresponds to consistent edits in sample space, the basis of techniques like SDEdit (Meng et al. 2022).
Failure Mode
The ODE shares the score model's error budget but loses the regularizing stochasticity. In regions where the score is mispredicted, deterministic trajectories accumulate the error monotonically, while the SDE injects fresh noise that can correct course. EDM and DPM-Solver compensate with higher-order solvers (Heun, multistep) rather than more steps.
Classifier-Free Guidance
Statement
Let be the noise-perturbed conditional and the unconditional marginal. The Bayesian rule implies
Sampling from a sharpened conditional (more weight on the implicit classifier) corresponds to using the guided score
or equivalently in noise-prediction form
The conditional and unconditional scores are produced by the same network trained with the conditioning signal randomly dropped to a null token at rate (Ho & Salimans 2022).
Intuition
The model's own conditional and unconditional predictions act as an implicit classifier . Extrapolating in that direction sharpens the conditional, biasing samples toward higher implicit-classifier confidence in .
Why It Matters
Every text-to-image and text-to-video system in production uses classifier-free guidance. It replaced the earlier classifier guidance of Dhariwal & Nichol (2021), which required training a separate noise-conditioned classifier. CFG needs only a single dropout rate and one extra forward pass per sampling step.
Failure Mode
At high (typically for image models), samples become saturated, high-contrast, and lose diversity. At very high they fall off the data manifold entirely, producing the characteristic "burnt" look of overguided diffusion. The cause is that the implicit-classifier extrapolation pushes samples into low-density regions where the score model was never trained accurately. CFG-zero, CFG-rescaled (Lin et al. 2024), and dynamic-thresholding (Imagen) are workarounds that cap or rescale the extrapolation magnitude.
Discrete-Time Samplers
The reverse SDE and probability flow ODE both need a numerical integrator. Sampler choice trades quality for steps. Three families dominate practice.
Ancestral / DDPM sampling. A first-order Euler-Maruyama on the reverse VP-SDE. Use steps for high quality, for fast research baselines. Stochastic (injects fresh Gaussian noise per step), so it self-corrects but needs many steps.
DDIM. A first-order discretization of the probability flow ODE for the VP-SDE (Song, Meng, Ermon 2020). Deterministic, with a single hyperparameter that interpolates between fully deterministic () and ancestral (). Hits competitive FID at 25-50 steps.
EDM Heun & DPM-Solver. Higher-order ODE solvers tailored to the diffusion ODE. EDM (Karras et al. 2022) uses Heun's second-order method on a -parameterized ODE with the Karras schedule; DPM-Solver (Lu et al. 2022) and DPM-Solver++ exploit the semi-linear structure of the diffusion ODE to reach FID-competitive samples in 10-20 steps. UniPC (Zhao et al. 2023) generalizes both with a multistep predictor-corrector.
Distillation-based fast samplers. Progressive distillation (Salimans & Ho 2022), consistency models (Song et al. 2023), and consistency distillation (Luo et al. 2023) train student networks that take large jumps in noise level, reaching 1-4 step generation at the cost of an extra training stage.
Latent Diffusion
Pixel-space diffusion at is computationally heavy because every denoising step must process every pixel. Latent Diffusion Models (LDM, Rombach et al. 2022) move the diffusion process into the latent space of a pretrained autoencoder.
The pipeline has three stages:
- A VAE encoder maps an image to a latent with and between 4 and 16.
- A diffusion model is trained on the latents with the standard DDPM or score-SDE objective.
- A VAE decoder maps generated latents back to pixels.
The autoencoder is trained once with a perceptual loss and a small KL-regularizer; it is held fixed during diffusion training. The compute saving is roughly per denoising step, which is what made Stable Diffusion practical on consumer GPUs. Stable Diffusion 1.x/2.x, SDXL, SD3 and SD3.5 are all latent diffusion. Imagen and Flux.1 use related cascaded or higher-resolution-latent variants.
The trade-off is fidelity at fine scales. The autoencoder bottleneck loses high-frequency detail; tiny text and faces are the canonical failure mode of latent diffusion. SDXL added a refiner stage, and SD3 increased the latent channel count from 4 to 16, both targeting this.
Flow Matching
Conditional Flow Matching
Statement
Let for , , . The conditional flow matching loss
trains a velocity field whose ODE transports the source to along straight lines from each sampled noise vector to its paired data point (Lipman et al. 2023).
Intuition
Diffusion takes curved paths from data to noise and back through a stochastic SDE; flow matching takes straight lines through a deterministic ODE. The training target is no longer "the noise that was added" but "the constant velocity that interpolates between this noise and this data point."
Why It Matters
Flow matching is a strict generalization of the probability flow ODE for diffusion: choosing the conditional path with recovers the diffusion ODE; the linear path is rectified flow (Liu et al. 2023), which yields straighter trajectories and faster sampling. Stable Diffusion 3 (Esser et al. 2024) and most newer image/video systems are flow-matching trained because the straighter paths permit larger ODE step sizes.
Failure Mode
The linearity is in the latent (or pixel) space, not in the data manifold's intrinsic geometry. For data on highly curved manifolds (molecular conformations, 3D rotations, group-valued data), straight pixel-space paths cross low-density regions and the velocity field has to make large, hard-to-fit corrections. Riemannian flow matching (Chen & Lipman 2024) replaces straight lines with manifold geodesics for these cases.
Failure Modes and Engineering Realities
Schedule sensitivity. The variance schedule is not a hyperparameter you can sweep casually. The cosine schedule of Nichol & Dhariwal (2021) outperforms the linear DDPM schedule across image scales because it spends more steps near where the model learns the most. EDM (Karras et al. 2022) reframes the problem in terms of rather than and shows that the implied distribution over during training is the right object to design.
Exposure bias. During training the network sees true noisy samples . During sampling it sees its own previous predictions, which carry error. The mismatch compounds across steps; Ning et al. (2024) analyze it and propose epsilon-scaling, a training-free correction that improves FID across DDIM, EDM, and LDM.
Truncation at the endpoints. The score blows up as (data manifold has near-zero off-manifold density) and the model is poorly calibrated as . Karras et al. (2022) address this with explicit input/output preconditioning that keeps the network's effective inputs and targets in a fixed numerical range across all .
Mode coverage versus prompt fidelity. High classifier-free guidance sharpens prompt adherence at the cost of mode coverage; the resulting samples look prototypical and lose the long-tail diversity present at . There is no free lunch; the practical recipe is to use moderate guidance ( for SD-style image models) and rely on prompting and conditioning channels (ControlNet, IP-Adapter) for fine control rather than pushing .
Sampler-architecture mismatch. A network trained with the simplified -prediction loss using DDPM ancestral sampling will not necessarily transfer cleanly to a DPM-Solver++ ODE solver at low step counts. EDM Heun + EDM-trained networks form one matched stack; rectified-flow Euler + flow-matching networks form another. Mixing across stacks can leave sampler-quality on the table.
Where This Shows Up
- Image generation: Stable Diffusion 3.5, SDXL, DALL-E 3, Imagen 3, Midjourney v6, Flux.1, Ideogram
- Video generation: Sora (OpenAI), Veo 3 (Google), Kling 2 (Kuaishou), Runway Gen-3, Pika, video world models
- Audio synthesis: AudioLDM 2, Stable Audio 2, Diff-A-Riff, music generation systems
- 3D generation: DreamFusion (SDS-based text-to-3D), MVDream multi-view diffusion, point-cloud diffusion, and Gaussian splatting generators
- Molecular and protein design: RFdiffusion (RF-diffusion for protein structure), GeoDiff for molecular conformers, Boltz-1 for structure prediction with diffusion components
- Robotics: Diffusion Policy (Chi et al. 2023) for visuomotor control, 3D Diffuser Actor for manipulation
- Decision making and planning: Janner et al. 2022 trajectory diffusion; world-model planners that diffuse over future states
Diffusion vs. GANs in One Table
If someone asks "why are GANs not diffusion models?" the clean answer is: diffusion learns how to reverse a fixed corruption process, while GANs learn to fool a discriminator. Both may start from noise at sampling time, but the objective, the intermediate states, and the failure modes are different.
| Question | Diffusion | GAN |
|---|---|---|
| What is the training target? | Predict noise, denoised sample, or score at many noise levels | Fool a discriminator or critic |
| Is there a fixed forward process? | Yes: add Gaussian noise by design | No: latent prior is sampled directly |
| What is optimized? | Regression or ELBO-style objective | Minimax game |
| How is a sample generated? | Iterative reverse denoising from to | One generator pass from to |
| Typical weakness | Slow sampling | Mode collapse and unstable training |
| Typical strength | Stable training, strong mode coverage, flexible conditioning | Very fast inference once trained |
Common Confusions
DDPM and score-based models are not different methods
The DDPM derivation starts from a variational bound on a Markov chain; the NCSN/score-SDE derivation starts from score matching plus the reverse-time SDE. They produce the same training loss up to per-timestep weighting and the same network architecture. The two papers (Ho et al. 2020, Song & Ermon 2019) arrived at the same algorithm from different angles within months of each other, and Song et al. (2021) made the equivalence rigorous via the SDE view. Treat them as one framework with two parameterizations.
Probability flow ODE is not flow matching
Both produce deterministic ODEs that transport noise to data, and both are called "ODE samplers" in casual usage. They differ in what is held fixed:
- Probability flow ODE is derived from a forward stochastic process; the velocity field equals and is determined once the SDE is chosen.
- Flow matching starts from a chosen probability path and trains a velocity field directly via regression. There is no underlying SDE.
Diffusion's probability flow ODE is a special case of flow matching, but flow matching is the more general construction. Rectified flow (Liu et al. 2023) is one specific flow-matching path that is not a diffusion probability flow ODE.
More steps is not always better
DDPM originally used steps because the discretization error of ancestral sampling on the reverse VP-SDE is large at coarse step counts. Modern higher-order solvers on the probability flow ODE (DDIM, DPM-Solver++, EDM Heun) close most of the quality gap at 10-25 steps. Step count is a sampler choice, not a model property; a single trained model can be sampled at any step count from 1 (with distillation) to 1000.
Guidance scale is not a quality dial
Increasing classifier-free guidance scale does not monotonically improve quality. Past a sweet spot (typically for image models) the samples saturate, lose color diversity, and pick up high-frequency artifacts. The mechanism is that the implicit-classifier extrapolation pushes the score into low-density regions where it was never trained. CFG-rescaled and dynamic-thresholding are practical workarounds, but high trading prompt fidelity for sample diversity is a structural feature of the method, not a bug to be tuned away.
Latent diffusion is not a separate generative model family
Stable Diffusion is a latent diffusion model: an autoencoder plus a standard diffusion process on the latents. The diffusion math is unchanged; only the operating space differs. Failure modes of latent diffusion (small text, faces) are autoencoder failures, not diffusion failures.
Summary
- Forward process adds Gaussian noise on a fixed schedule; the closed-form marginal makes training trivially parallel.
- Reverse process is a learned Markov kernel (DDPM) or, equivalently, a learned score plugged into the reverse-time SDE (score-SDE). The two views train the same network with the same loss.
- DDPM ELBO decomposes into per-step KL terms that reduce to a weighted noise-prediction MSE; the simplified drops the weights and trains best in practice.
- Reverse-time SDE (Anderson 1982) is the foundational result: for any forward SDE, the reverse process equals the forward dynamics minus plus a backward Wiener increment.
- Probability flow ODE removes the stochasticity while preserving marginal densities. It enables deterministic sampling (DDIM), exact likelihoods, and invertible noise-to-sample bijections.
- Classifier-free guidance sharpens conditional samples by extrapolating in the direction of the implicit classifier . It is the mechanism behind every modern text-to-image system.
- Latent diffusion moves the diffusion process into a pretrained autoencoder's latent space, cutting per-step compute by roughly and making consumer-GPU image synthesis practical.
- Flow matching generalizes the probability flow ODE by training a velocity field directly against any chosen probability path; rectified flow is the linear-path special case used by SD3 and Flux.
Exercises
Problem
Given (i.e., 99% of the variance is noise), write in terms of and . What fraction of the original signal amplitude remains, and what is the signal-to-noise ratio?
Problem
Show that the DDPM noise-prediction objective and the score model are linked by . What does this imply about training a single network and using it both as a DDPM denoiser and as a score model in the reverse-time SDE?
Problem
Derive the probability flow ODE from the forward SDE by writing the Fokker-Planck equation for and re-expressing the diffusion term as a drift correction.
Problem
A practitioner trains a diffusion model with and reports excellent FID but poor reported NLL relative to a normalizing-flow baseline. Is this a contradiction? Explain how relates to a tight likelihood bound and what to change if NLL is the priority.
Problem
An interviewer says: "Both GANs and diffusion start from noise and end at an image. Why are they different?" Give a mechanism-level answer.
Problem
Latent diffusion processes images through a fixed VAE bottleneck. Suppose you want to fine-tune Stable Diffusion 1.5 (4-channel latent, downsampling) to render legible small text. The diffusion training loss decreases but generated text remains garbled. Where is the bottleneck, and what would you change?
Related Comparisons
Frequently Asked Questions
- How is a diffusion model different from a VAE?
- A VAE has an encoder that learns a compressed latent and a decoder that maps back. A diffusion model has no learned encoder; the forward process is a fixed Gaussian noise schedule, and only the reverse denoiser is trained. Latent diffusion combines both: a VAE provides the latent space, and a diffusion model operates inside it.
- Why does DDIM sample faster than DDPM?
- DDPM is the stochastic reverse SDE discretized at every training timestep, typically 1000 steps. DDIM uses the deterministic probability-flow ODE with the same trained score, so the sampler is non-Markovian and can take much larger steps. With 50 steps DDIM produces samples comparable to 1000-step DDPM at the same model.
- What is classifier-free guidance?
- A way to steer samples toward a condition without a separate classifier. Train one model on both conditional and unconditional (with randomly dropped). At sampling time, combine the conditional and unconditional scores: . Higher trades sample diversity for prompt fidelity.
- Why use latent diffusion (Stable Diffusion) instead of pixel diffusion?
- Pixel-space diffusion at is computationally heavy and the diffusion model wastes capacity on perceptual details a learned VAE can recover. Latent diffusion runs the U-Net on an 8x downsampled VAE latent, cuts compute roughly 64x, and concentrates the diffusion model's modeling power on the semantic structure of the image. This is the design choice that made high-resolution image generation tractable.
- Are diffusion and flow matching the same thing?
- Mathematically very close. The probability-flow ODE of a diffusion process IS a flow. Flow matching (Lipman et al. 2023) directly learns the velocity field rather than the score; the training objective has lower variance and the resulting flow is straighter, enabling fewer-step sampling. Rectified flow takes this further by iteratively rectifying the flow toward straight-line trajectories.
References
Foundations:
- Hyvärinen, "Estimation of Non-Normalized Statistical Models by Score Matching", JMLR 6:695-709 (2005). The original score-matching objective with the integration-by-parts trick that avoids the normalizing constant.
- Vincent, "A Connection Between Score Matching and Denoising Autoencoders", Neural Computation 23(7):1661-1674 (2011). Establishes the denoising-score-matching equivalence that licenses learning by predicting noise.
- Anderson, "Reverse-time diffusion equation models", Stochastic Processes and their Applications 12(3):313-326 (1982). The reverse-time SDE result that powers all score-based sampling.
- Sohl-Dickstein, Weiss, Maheswaranathan, Ganguli, "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" (ICML 2015), arXiv:1503.03585. The original diffusion-model paper, five years before DDPM scaled it.
Canonical (2019-2021):
- Song & Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution" (NCSN, NeurIPS 2019), arXiv:1907.05600. The score-based-generation paper rediscovering and scaling Vincent (2011).
- Ho, Jain, Abbeel, "Denoising Diffusion Probabilistic Models" (NeurIPS 2020), arXiv:2006.11239. The paper that made diffusion competitive with GANs on images. Section 3 derives .
- Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021), arXiv:2011.13456. Unifies DDPM and NCSN under the continuous-time SDE framework; introduces the probability flow ODE.
- Song, Meng, Ermon, "Denoising Diffusion Implicit Models" (DDIM, ICLR 2021), arXiv:2010.02502. Deterministic sampling with 10-50 steps via the probability flow ODE perspective.
- Nichol & Dhariwal, "Improved Denoising Diffusion Probabilistic Models" (ICML 2021), arXiv:2102.09672. Cosine schedule, learned variances, log-likelihood improvements.
Conditioning and guidance:
- Dhariwal & Nichol, "Diffusion Models Beat GANs on Image Synthesis" (NeurIPS 2021), arXiv:2105.05233. Classifier guidance with a separately trained noise-conditioned classifier; established the FID parity claim.
- Ho & Salimans, "Classifier-Free Diffusion Guidance" (NeurIPS Workshop 2021, full version 2022), arXiv:2207.12598. The single-network conditioning trick behind every production text-to-image system.
Practical training and sampling (2022-2023):
- Karras, Aittala, Aila, Laine, "Elucidating the Design Space of Diffusion-Based Generative Models" (EDM, NeurIPS 2022), arXiv:2206.00364. The reference parameterization analysis with second-order Heun sampler; reframes everything in terms of .
- Lu, Zhou, Bao, Chen, Li, Zhu, "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling" (NeurIPS 2022), arXiv:2206.00927; and the DPM-Solver++ followup. High-order ODE solvers tailored to the diffusion ODE.
- Salimans & Ho, "Progressive Distillation for Fast Sampling of Diffusion Models" (ICLR 2022), arXiv:2202.00512. Distill an -step teacher into a -step student, iterated to 1-4 steps.
- Song, Dhariwal, Chen, Sutskever, "Consistency Models" (ICML 2023), arXiv:2303.01469. Single-step generation by training a network to map any point on the ODE trajectory to its origin.
- Lipman, Chen, Ben-Hamu, Nickel, Le, "Flow Matching for Generative Modeling" (ICLR 2023), arXiv:2210.02747. Continuous normalizing flows recast as regression against velocity fields; foundation for SD3 and Flux.
- Liu, Gong, Liu, "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow" (ICLR 2023), arXiv:2209.03003. The straight-line conditional path that yields the simplest flow-matching objective.
Latent and frontier image systems:
- Rombach, Blattmann, Lorenz, Esser, Ommer, "High-Resolution Image Synthesis with Latent Diffusion Models" (LDM/Stable Diffusion, CVPR 2022), arXiv:2112.10752. The autoencoder-plus-diffusion architecture behind SD 1.x/2.x/SDXL.
- Esser, Kulal, Blattmann et al., "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" (Stable Diffusion 3, ICML 2024), arXiv:2403.03206. Flow-matching MMDiT architecture; widely deployed open-weights system as of 2024.
- Saharia et al., "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding" (Imagen, NeurIPS 2022), arXiv:2205.11487. Cascaded pixel-space diffusion with T5-XXL text conditioning.
Failure modes and corrections:
- Ning, Li, Su, Salah, Ertugrul, "Elucidating the Exposure Bias in Diffusion Models" (ICLR 2024), arXiv:2308.15321. The training-vs-sampling input-distribution mismatch; introduces epsilon-scaling.
- Lin, Liu, Liu, "Common Diffusion Noise Schedules and Sample Steps are Flawed" (WACV 2024), arXiv:2305.08891. Zero-SNR schedule fix and CFG rescale; explains the saturated-color failure mode of high-CFG sampling.
Selected applications:
- Chi, Feng, Du et al., "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion" (RSS 2023), arXiv:2303.04137. Diffusion as a robotics action prior.
- Xu, Yu, Song, Shi, Ermon, Tang, "GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation" (ICLR 2022), arXiv:2203.02923. Diffusion on molecular geometry.
- Watson, Juergens, Bennett et al., "De novo design of protein structure and function with RFdiffusion" (Nature 2023). RoseTTAFold + diffusion for protein backbone design.
Next Topics
The natural next steps from diffusion models:
- Flow matching: the generalization that drives Stable Diffusion 3, Flux, and most recent video models.
- Score matching: the underlying objective with the full Hyvärinen-Vincent derivation and its extensions.
- Video world models: how Sora, Veo, and Kling extend image diffusion to spatiotemporal data with simulation-grade physical fidelity claims.
- Gaussian splatting: the dominant 3D representation that diffusion models increasingly target.
Last reviewed: April 19, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
19- Score Matchinglayer 3 · tier 1
- Variational Autoencoderslayer 3 · tier 1
- CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraininglayer 4 · tier 1
- Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMlayer 4 · tier 1
- PDE Fundamentals for Machine Learninglayer 1 · tier 2
Derived topics
4- Flow Matchinglayer 4 · tier 2
- Video World Modelslayer 5 · tier 2
- 3D Gaussian Splattinglayer 4 · tier 3
- Deep Generative Models for Moleculeslayer 4 · tier 3
Graph-backed continuations