Skip to main content

ML Methods

Variational Autoencoders

Deriving the ELBO, the reparameterization trick for backpropagation through sampling, and how VAEs turn autoencoders into principled generative models via amortized variational inference.

AdvancedTier 1StableSupporting~65 min

Why This Matters

Input xe.g. imageEncoderq(z|x)neural networkoutputs μ and σμσε ~ N(0,1)Reparamz = μ + σ·εzDecoderp(x|z)neural networkreconstructs xreconstructionL = reconstruction loss + KL(q(z|x) ‖ p(z))‖x - x̂‖²pushes q toward N(0,1)Reparameterization trick: z = μ + σ·ε makes sampling differentiable (gradients flow through μ and σ, not through the sample)

The VAE is where deep learning meets probabilistic modeling. It solves a fundamental problem: how to learn a generative model p(x)p(x) when the data involves latent variables zz that make the marginal likelihood p(x)=p(xz)p(z)dzp(x) = \int p(x|z)p(z)dz intractable. The solution, the evidence lower bound (ELBO) and amortized inference, relies on KL divergence to measure the gap between the approximate and true posterior. This is one of the most load-bearing constructions in modern ML and underpins much of generative AI.

Mental Model

You want to learn a generative model: sample zz from a simple prior (Gaussian), then decode zz into data xx. The problem is inference: given an observed xx, what zz likely generated it? The true posterior p(zx)p(z|x) is intractable. The VAE learns an approximate posterior qϕ(zx)q_\phi(z|x) (the encoder) jointly with the generative model pθ(xz)p_\theta(x|z) (the decoder) by maximizing a lower bound on the log-likelihood.

The Generative Model

Definition

VAE Generative Model

The VAE defines a latent variable model:

  1. Prior: zp(z)=N(0,I)z \sim p(z) = \mathcal{N}(0, I)
  2. Likelihood (decoder): xpθ(xz)x \sim p_\theta(x|z), parameterized by a neural network that maps zz to the parameters of a distribution over xx

The marginal likelihood (evidence) is:

pθ(x)=pθ(xz)p(z)dzp_\theta(x) = \int p_\theta(x|z) p(z) \, dz

This integral is intractable for nonlinear decoders because it requires integrating over all possible latent codes.

Deriving the ELBO

The key insight: since we cannot compute logpθ(x)\log p_\theta(x) directly, we derive a tractable lower bound.

Theorem

Evidence Lower Bound (ELBO)

Statement

For any distribution qϕ(zx)q_\phi(z|x), the log marginal likelihood satisfies:

logpθ(x)Eqϕ(zx)[logpθ(xz)]reconstructionDKL(qϕ(zx)p(z))regularization=L(θ,ϕ;x)\log p_\theta(x) \geq \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{reconstruction}} - \underbrace{D_{\text{KL}}(q_\phi(z|x) \| p(z))}_{\text{regularization}} = \mathcal{L}(\theta, \phi; x)

This lower bound is called the ELBO (Evidence Lower Bound). Equality holds when qϕ(zx)=pθ(zx)q_\phi(z|x) = p_\theta(z|x), the true posterior.

Intuition

The ELBO has two terms pulling in opposite directions:

  • Reconstruction term: encourages the decoder to reconstruct xx from codes sampled via the encoder. Wants the encoder to be informative.
  • KL term: encourages the encoder distribution qϕ(zx)q_\phi(z|x) to stay close to the prior p(z)=N(0,I)p(z) = \mathcal{N}(0, I). Wants the latent space to be structured and smooth.

The tension between these terms is the VAE tradeoff: be informative enough to reconstruct, but regular enough that the latent space has meaningful structure for generation.

Proof Sketch

Start with the log-evidence and introduce qq:

logpθ(x)=logpθ(x,z)dz=logpθ(x,z)qϕ(zx)qϕ(zx)dz\log p_\theta(x) = \log \int p_\theta(x, z) \, dz = \log \int \frac{p_\theta(x, z)}{q_\phi(z|x)} q_\phi(z|x) \, dz

Apply Jensen's inequality (log\log is concave):

qϕ(zx)logpθ(x,z)qϕ(zx)dz=Eq[logpθ(x,z)]Eq[logqϕ(zx)]\geq \int q_\phi(z|x) \log \frac{p_\theta(x, z)}{q_\phi(z|x)} \, dz = \mathbb{E}_q[\log p_\theta(x, z)] - \mathbb{E}_q[\log q_\phi(z|x)]

Expand pθ(x,z)=pθ(xz)p(z)p_\theta(x, z) = p_\theta(x|z)p(z):

=Eq[logpθ(xz)]+Eq[logp(z)]Eq[logqϕ(zx)]= \mathbb{E}_q[\log p_\theta(x|z)] + \mathbb{E}_q[\log p(z)] - \mathbb{E}_q[\log q_\phi(z|x)]

=Eq[logpθ(xz)]DKL(qϕ(zx)p(z))= \mathbb{E}_q[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))

Why It Matters

The ELBO transforms an intractable maximum likelihood problem into a tractable optimization. The gap between logp(x)\log p(x) and the ELBO is exactly DKL(qϕ(zx)pθ(zx))D_{\text{KL}}(q_\phi(z|x) \| p_\theta(z|x)). The approximation quality of the encoder. Maximizing the ELBO simultaneously fits the generative model and improves the approximate posterior.

Failure Mode

The ELBO can be loose if qϕq_\phi is too simple to approximate the true posterior (e.g., a diagonal Gaussian when the true posterior is multimodal). This leads to posterior collapse: the model ignores the latent variables (qp(z)q \approx p(z), KL 0\approx 0) and relies entirely on a powerful decoder.

An equivalent derivation shows the gap directly:

logpθ(x)=L(θ,ϕ;x)+DKL(qϕ(zx)pθ(zx))\log p_\theta(x) = \mathcal{L}(\theta, \phi; x) + D_{\text{KL}}(q_\phi(z|x) \| p_\theta(z|x))

Since KL divergence is non-negative, Llogpθ(x)\mathcal{L} \leq \log p_\theta(x).

The Reparameterization Trick

Definition

Reparameterization Trick

The reconstruction term Eqϕ(zx)[logpθ(xz)]\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] requires sampling zqϕ(zx)z \sim q_\phi(z|x), but we cannot backpropagate through a sampling operation.

The reparameterization trick expresses the sample as a deterministic function of the parameters and an independent noise variable:

z=μϕ(x)+σϕ(x)ε,εN(0,I)z = \mu_\phi(x) + \sigma_\phi(x) \odot \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, I)

where μϕ(x)\mu_\phi(x) and σϕ(x)\sigma_\phi(x) are the mean and standard deviation output by the encoder network.

Now the randomness is in ε\varepsilon (which does not depend on ϕ\phi), and zz is a differentiable function of ϕ\phi. Standard backpropagation works.

Without reparameterization, you would need high-variance score function estimators (REINFORCE). The reparameterization trick gives low-variance gradient estimates, making VAE training practical.

The KL Term in Detail

For the standard VAE with Gaussian encoder qϕ(zx)=N(μ,diag(σ2))q_\phi(z|x) = \mathcal{N}(\mu, \text{diag}(\sigma^2)) and standard normal prior p(z)=N(0,I)p(z) = \mathcal{N}(0, I), the KL divergence has a closed form:

DKL(qp)=12j=1d(μj2+σj2logσj21)D_{\text{KL}}(q \| p) = \frac{1}{2}\sum_{j=1}^{d}\left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\right)

This is computed analytically. no sampling needed. Each latent dimension contributes independently, making it easy to monitor which dimensions are active (significantly different from the prior).

Connection to EM

Definition

Amortized Variational EM

The classical EM algorithm for latent variable models alternates:

  • E-step: compute the posterior p(zx;θ)p(z|x; \theta) for each data point
  • M-step: maximize the expected complete log-likelihood

The VAE can be viewed as amortized variational EM:

  • The encoder qϕ(zx)q_\phi(z|x) replaces the E-step. It amortizes inference by learning a single network that works for all xx, rather than running separate optimization for each data point
  • The decoder pθ(xz)p_\theta(x|z) corresponds to the M-step
  • Both are optimized jointly via gradient descent on the ELBO

Classical variational inference computes a separate q(z)q(z) for each observation (expensive). Amortization is what makes VAEs scalable: one forward pass through the encoder gives the approximate posterior for any xx.

Modern VAE Variants

The original Kingma-Welling VAE is the simplest member of a large family. Three extensions matter most in current practice.

VQ-VAE: Discrete Latents

The vector-quantized VAE (van den Oord, Vinyals, Kavukcuoglu 2017) replaces the continuous Gaussian latent with a discrete code book. The encoder maps xx to a continuous vector ze(x)z_e(x), which is then snapped to the nearest entry eke_k in a learned code book {e1,,eK}\{e_1, \ldots, e_K\}:

zq(x)=ek,k=argminkze(x)ek2z_q(x) = e_{k^*}, \quad k^* = \arg\min_k \|z_e(x) - e_k\|_2

The decoder receives zqz_q. Because argmin\arg\min is non-differentiable, gradients are passed through via the straight-through estimator: ze=zq\nabla_{z_e} = \nabla_{z_q}. The training loss combines reconstruction with a code-book commitment loss ze(x)sg[ek]2\|z_e(x) - \text{sg}[e_{k^*}]\|^2 and a code-book update loss sg[ze(x)]ek2\|\text{sg}[z_e(x)] - e_{k^*}\|^2 (where sg\text{sg} is stop-gradient).

VQ-VAE eliminates posterior collapse (the discrete latent must be used) and produces sharp samples. It is the dominant tokenizer architecture for image, audio, and video generation: Stable Diffusion's image tokenizer, DALL-E's discrete VAE, EnCodec for audio, and most modern multimodal models all use VQ-VAE-style quantization. VQ-VAE-2 (Razavi, van den Oord, Vinyals 2019) extends this to a hierarchy of code books at multiple resolutions.

Hierarchical VAEs and the Bridge to Diffusion

Stacking latent variables in a hierarchy z1z2zLxz_1 \to z_2 \to \cdots \to z_L \to x gives more expressive priors and posteriors. NVAE (Vahdat & Kautz, NeurIPS 2020) showed that with careful normalization and residual connections, hierarchical VAEs can match GAN-quality image generation.

The deepest connection is to diffusion models. Kingma et al. 2021 (Variational Diffusion Models) showed that a diffusion model is exactly a hierarchical VAE with a fixed (non-learned) Gaussian encoder and an infinitely deep latent hierarchy. The ELBO of a diffusion model is the limit of the hierarchical VAE ELBO as the number of timesteps TT \to \infty. This unifies the two dominant generative-modeling paradigms: VAEs and diffusion are the same family, differing only in encoder choice and depth.

Latent Diffusion Models (Rombach et al., CVPR 2022) make this explicit: train a VAE to compress images to a low-dimensional latent space, then train a diffusion model in that space. Stable Diffusion is precisely this construction.

Conditional VAE

The Conditional VAE (Sohn, Lee, Yan, NeurIPS 2015) introduces a condition cc (label, attribute, or context) into both encoder and decoder: qϕ(zx,c)q_\phi(z | x, c) and pθ(xz,c)p_\theta(x | z, c). The ELBO becomes

L=Eqϕ(zx,c)[logpθ(xz,c)]DKL(qϕ(zx,c)p(zc))\mathcal{L} = \mathbb{E}_{q_\phi(z|x,c)}[\log p_\theta(x|z, c)] - D_{\text{KL}}(q_\phi(z|x,c) \| p(z|c))

This is the standard formulation for class-conditional or attribute-conditional generation, and the foundation for any controllable VAE-style generative model.

Posterior Collapse and Its Mitigations

Posterior collapse (the encoder learning q(zx)=p(z)q(z|x) = p(z), KL 0\to 0) is the central failure mode of VAEs with powerful decoders. Standard mitigations:

  • KL annealing (Bowman et al., CoNLL 2016): start with KL weight β=0\beta = 0 and ramp to β=1\beta = 1 over training. Lets the encoder establish informative representations before the regularizer activates.
  • Free bits (Kingma et al., NeurIPS 2016): clamp the per-dimension KL at a minimum value λ\lambda, so dimensions with KL below λ\lambda are not penalized. Forces a minimum amount of information through the latent.
  • Constrain the decoder: weaken the autoregressive decoder so it cannot model p(x)p(x) alone. Forces the latent to carry information.
  • β-VAE with β<1\beta < 1: trade tightness of the bound for a more informative latent. Loses the strict ELBO interpretation.
  • IWAE bound (Burda, Grosse, Salakhutdinov, ICLR 2016): use KK importance samples z(1),,z(K)qz^{(1)}, \ldots, z^{(K)} \sim q to construct a tighter bound LK=E[log1Kkp(x,z(k))q(z(k)x)]\mathcal{L}_K = \mathbb{E}\left[\log \frac{1}{K}\sum_k \frac{p(x, z^{(k)})}{q(z^{(k)}|x)}\right]. LKlogp(x)\mathcal{L}_K \uparrow \log p(x) as KK \to \infty.

Common Confusions

Watch Out

The KL term is not just a penalty. It has a precise information-theoretic meaning

A common misunderstanding is that the KL term is a "regularizer" added for convenience, like weight decay. It is not. The KL term arises necessarily from the ELBO derivation. It measures how much information the encoder extracts about the specific input xx beyond what the prior already provides. Re-weighting the KL term (as in β-VAE; Higgins et al., ICLR 2017) changes the probabilistic semantics: with β1\beta \geq 1 the objective remains a valid (just looser) lower bound on logp(x)\log p(x), since increasing the weight of a non-negative KL only decreases the bound. With β<1\beta < 1 the objective is no longer guaranteed to lower-bound logp(x)\log p(x) and you lose the ELBO interpretation entirely.

Watch Out

VAEs do not optimize reconstruction plus a penalty

The ELBO looks like "reconstruction - KL", which tempts people to treat it as a penalized autoencoder. But the correct interpretation is: the ELBO is a lower bound on the log-evidence, derived from first principles. The reconstruction and KL terms are not independent objectives. They are two parts of a single variational inference procedure. Changing their relative weight changes the probabilistic semantics.

Watch Out

Posterior collapse is not a bug in the ELBO

When a powerful autoregressive decoder can model p(x)p(x) without using zz, the optimal solution sets q(zx)=p(z)q(z|x) = p(z) (zero KL) and ignores the latent variables. This is actually the correct ELBO optimum. The model has discovered that latent variables are unnecessary. Whether this is desirable depends on whether you want meaningful latent representations (often yes) or just good p(x)p(x) (then it is fine).

Canonical Examples

Example

VAE on MNIST

Encoder: two-layer MLP mapping 7845122d784 \to 512 \to 2d (outputting μ\mu and logσ2\log\sigma^2, each dd-dimensional). Decoder: MLP mapping d512784d \to 512 \to 784 with sigmoid output (Bernoulli likelihood). With d=2d = 2, the latent space can be directly visualized: different digit classes cluster in different regions, and interpolating between two latent codes produces smooth morphing between digits.

Summary

  • The ELBO: logp(x)Eq[logp(xz)]DKL(q(zx)p(z))\log p(x) \geq \mathbb{E}_q[\log p(x|z)] - D_{\text{KL}}(q(z|x) \| p(z))
  • Gap between logp(x)\log p(x) and ELBO is DKL(q(zx)p(zx))D_{\text{KL}}(q(z|x) \| p(z|x))
  • Reparameterization trick: z=μ+σεz = \mu + \sigma \odot \varepsilon enables backprop through sampling
  • KL term has closed form for Gaussian qq and Gaussian prior
  • VAE = amortized variational EM: encoder amortizes the E-step
  • The KL term is not a regularizer. It is part of the variational bound

Exercises

ExerciseCore

Problem

Derive the closed-form KL divergence between q=N(μ,σ2)q = \mathcal{N}(\mu, \sigma^2) (univariate) and p=N(0,1)p = \mathcal{N}(0, 1).

ExerciseAdvanced

Problem

Show that logp(x)=L(θ,ϕ;x)+DKL(qϕ(zx)pθ(zx))\log p(x) = \mathcal{L}(\theta, \phi; x) + D_{\text{KL}}(q_\phi(z|x) \| p_\theta(z|x)). Use this to explain why maximizing the ELBO tightens the bound.

ExerciseResearch

Problem

In the beta-VAE, the objective is Eq[logp(xz)]βDKL(q(zx)p(z))\mathbb{E}_q[\log p(x|z)] - \beta \cdot D_{\text{KL}}(q(z|x) \| p(z)) with β>1\beta > 1. This is still a valid lower bound on logp(x)\log p(x) (just strictly looser than the ELBO, since increasing the weight of a non-negative KL term can only decrease the bound). What is the beta-VAE actually optimizing from an information-theoretic perspective, and why does β>1\beta>1 nevertheless change the learned representation?

Related Comparisons

References

Canonical:

  • Kingma & Welling, "Auto-Encoding Variational Bayes" (ICLR 2014; arXiv:1312.6114). The original VAE paper.
  • Rezende, Mohamed, Wierstra, "Stochastic Backpropagation and Approximate Inference in Deep Generative Models" (ICML 2014; arXiv:1401.4082). Independent contemporaneous derivation.
  • Doersch, "Tutorial on Variational Autoencoders" (2016; arXiv:1606.05908). The standard introductory tutorial.
  • Kingma & Welling, "An Introduction to Variational Autoencoders" (Foundations and Trends in ML, 2019; arXiv:1906.02691). Comprehensive tutorial monograph.

Discrete latents and tokenizers:

  • van den Oord, Vinyals, Kavukcuoglu, "Neural Discrete Representation Learning" (NeurIPS 2017; arXiv:1711.00937). The VQ-VAE paper. Most-deployed VAE variant.
  • Razavi, van den Oord, Vinyals, "Generating Diverse High-Fidelity Images with VQ-VAE-2" (NeurIPS 2019; arXiv:1906.00446). Hierarchical VQ-VAE.
  • Esser, Rombach, Ommer, "Taming Transformers for High-Resolution Image Synthesis" (CVPR 2021; arXiv:2012.09841). VQGAN: VQ-VAE with adversarial and perceptual losses.

Disentanglement and rate-distortion:

  • Higgins et al., "β\beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework" (ICLR 2017). The β-VAE paper.
  • Burda, Grosse, Salakhutdinov, "Importance Weighted Autoencoders" (ICLR 2016; arXiv:1509.00519). The IWAE bound.
  • Locatello et al., "Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations" (ICML 2019 best paper; arXiv:1811.12359). Impossibility result for unsupervised disentanglement.
  • Alemi et al., "Fixing a Broken ELBO" (ICML 2018; arXiv:1711.00464). Rate-distortion analysis of the ELBO.
  • Hoffman & Johnson, "ELBO Surgery: Yet Another Way to Carve Up the Variational Evidence Lower Bound" (NeurIPS 2016 Workshop). Decomposes the ELBO into mutual-information and marginal-KL terms.
  • Tolstikhin, Bousquet, Gelly, Schölkopf, "Wasserstein Auto-Encoders" (ICLR 2018; arXiv:1711.01558). Replaces KL with optimal-transport divergence.

Conditional and hierarchical:

  • Sohn, Lee, Yan, "Learning Structured Output Representation using Deep Conditional Generative Models" (NeurIPS 2015). The Conditional VAE.
  • Vahdat & Kautz, "NVAE: A Deep Hierarchical Variational Autoencoder" (NeurIPS 2020; arXiv:2007.03898). State-of-the-art hierarchical VAE.
  • Sønderby, Raiko, Maaløe, Sønderby, Winther, "Ladder Variational Autoencoders" (NeurIPS 2016; arXiv:1602.02282). Earlier hierarchical VAE with bottom-up + top-down inference.

Posterior collapse:

  • Bowman et al., "Generating Sentences from a Continuous Space" (CoNLL 2016; arXiv:1511.06349). KL annealing.
  • Kingma, Salimans, Jozefowicz, Chen, Sutskever, Welling, "Improving Variational Inference with Inverse Autoregressive Flow" (NeurIPS 2016; arXiv:1606.04934). Free bits.
  • He, Spokoyny, Neubig, Berg-Kirkpatrick, "Lagging Inference Networks and Posterior Collapse in Variational Autoencoders" (ICLR 2019; arXiv:1901.05534). Diagnosis and fix.

Bridge to diffusion:

  • Kingma, Salimans, Poole, Ho, "Variational Diffusion Models" (NeurIPS 2021; arXiv:2107.00630). Diffusion as the infinite-depth limit of hierarchical VAEs.
  • Rombach, Blattmann, Lorenz, Esser, Ommer, "High-Resolution Image Synthesis with Latent Diffusion Models" (CVPR 2022; arXiv:2112.10752). Stable Diffusion: VAE + diffusion.

Next Topics

The natural next steps from VAEs:

  • Diffusion models: a different approach to tractable generative modeling
  • Normalizing flows: exact likelihood via invertible transformations
  • Variational inference: the general framework behind the ELBO

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

7

Derived topics

7

+2 more on the derived-topics page.