Variational Autoencoders

Sneiderman, Robby

ML Methods

Variational Autoencoders

Deriving the ELBO, the reparameterization trick for backpropagation through sampling, and how VAEs turn autoencoders into principled generative models via amortized variational inference.

AdvancedTier 1StableSupporting~65 min

Prerequisites

Autoencoders Maximum Likelihood Estimation Autoencoders for Low Dimensional Dynamical Structures Bayesian Estimation

Quiz (4)Prereq Map

Why This Matters

The VAE is where deep learning meets probabilistic modeling. It solves a fundamental problem: how to learn a generative model $p(x)$ when the data involves latent variables $z$ that make the marginal likelihood $p(x) = \int p(x|z)p(z)dz$ intractable. The solution, the evidence lower bound (ELBO) and amortized inference, relies on KL divergence to measure the gap between the approximate and true posterior. This is one of the most load-bearing constructions in modern ML and underpins much of generative AI.

Mental Model

You want to learn a generative model: sample $z$ from a simple prior (Gaussian), then decode $z$ into data $x$ . The problem is inference: given an observed $x$ , what $z$ likely generated it? The true posterior $p(z|x)$ is intractable. The VAE learns an approximate posterior $q_\phi(z|x)$ (the encoder) jointly with the generative model $p_\theta(x|z)$ (the decoder) by maximizing a lower bound on the log-likelihood.

The Generative Model

Definition

VAE Generative Model

The VAE defines a latent variable model:

Prior: $z \sim p(z) = \mathcal{N}(0, I)$
Likelihood (decoder): $x \sim p_\theta(x|z)$ , parameterized by a neural network that maps $z$ to the parameters of a distribution over $x$

The marginal likelihood (evidence) is:

$p_\theta(x) = \int p_\theta(x|z) p(z) \, dz$

This integral is intractable for nonlinear decoders because it requires integrating over all possible latent codes.

Deriving the ELBO

The key insight: since we cannot compute $\log p_\theta(x)$ directly, we derive a tractable lower bound.

Theorem

Evidence Lower Bound (ELBO)

Statement

For any distribution $q_\phi(z|x)$ , the log marginal likelihood satisfies:

$\log p_\theta(x) \geq \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{reconstruction}} - \underbrace{D_{\text{KL}}(q_\phi(z|x) \| p(z))}_{\text{regularization}} = \mathcal{L}(\theta, \phi; x)$

This lower bound is called the ELBO (Evidence Lower Bound). Equality holds when $q_\phi(z|x) = p_\theta(z|x)$ , the true posterior.

Intuition

The ELBO has two terms pulling in opposite directions:

Reconstruction term: encourages the decoder to reconstruct $x$ from codes sampled via the encoder. Wants the encoder to be informative.
KL term: encourages the encoder distribution $q_\phi(z|x)$ to stay close to the prior $p(z) = \mathcal{N}(0, I)$ . Wants the latent space to be structured and smooth.

The tension between these terms is the VAE tradeoff: be informative enough to reconstruct, but regular enough that the latent space has meaningful structure for generation.

Proof Sketch

Start with the log-evidence and introduce $q$ :

$\log p_\theta(x) = \log \int p_\theta(x, z) \, dz = \log \int \frac{p_\theta(x, z)}{q_\phi(z|x)} q_\phi(z|x) \, dz$

Apply Jensen's inequality ( $\log$ is concave):

$\geq \int q_\phi(z|x) \log \frac{p_\theta(x, z)}{q_\phi(z|x)} \, dz = \mathbb{E}_q[\log p_\theta(x, z)] - \mathbb{E}_q[\log q_\phi(z|x)]$

Expand $p_\theta(x, z) = p_\theta(x|z)p(z)$ :

$= \mathbb{E}_q[\log p_\theta(x|z)] + \mathbb{E}_q[\log p(z)] - \mathbb{E}_q[\log q_\phi(z|x)]$

$= \mathbb{E}_q[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))$

Why It Matters

The ELBO transforms an intractable maximum likelihood problem into a tractable optimization. The gap between $\log p(x)$ and the ELBO is exactly $D_{\text{KL}}(q_\phi(z|x) \| p_\theta(z|x))$ . The approximation quality of the encoder. Maximizing the ELBO simultaneously fits the generative model and improves the approximate posterior.

Failure Mode

The ELBO can be loose if $q_\phi$ is too simple to approximate the true posterior (e.g., a diagonal Gaussian when the true posterior is multimodal). This leads to posterior collapse: the model ignores the latent variables ( $q \approx p(z)$ , KL $\approx 0$ ) and relies entirely on a powerful decoder.

report a correction →

An equivalent derivation shows the gap directly:

$\log p_\theta(x) = \mathcal{L}(\theta, \phi; x) + D_{\text{KL}}(q_\phi(z|x) \| p_\theta(z|x))$

Since KL divergence is non-negative, $\mathcal{L} \leq \log p_\theta(x)$ .

The Reparameterization Trick

Definition

Reparameterization Trick

The reconstruction term $\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]$ requires sampling $z \sim q_\phi(z|x)$ , but we cannot backpropagate through a sampling operation.

The reparameterization trick expresses the sample as a deterministic function of the parameters and an independent noise variable:

$z = \mu_\phi(x) + \sigma_\phi(x) \odot \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, I)$

where $\mu_\phi(x)$ and $\sigma_\phi(x)$ are the mean and standard deviation output by the encoder network.

Now the randomness is in $\varepsilon$ (which does not depend on $\phi$ ), and $z$ is a differentiable function of $\phi$ . Standard backpropagation works.

Without reparameterization, you would need high-variance score function estimators (REINFORCE). The reparameterization trick gives low-variance gradient estimates, making VAE training practical.

The KL Term in Detail

For the standard VAE with Gaussian encoder $q_\phi(z|x) = \mathcal{N}(\mu, \text{diag}(\sigma^2))$ and standard normal prior $p(z) = \mathcal{N}(0, I)$ , the KL divergence has a closed form:

$D_{\text{KL}}(q \| p) = \frac{1}{2}\sum_{j=1}^{d}\left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\right)$

This is computed analytically. no sampling needed. Each latent dimension contributes independently, making it easy to monitor which dimensions are active (significantly different from the prior).

Connection to EM

Definition

Amortized Variational EM

The classical EM algorithm for latent variable models alternates:

E-step: compute the posterior $p(z|x; \theta)$ for each data point
M-step: maximize the expected complete log-likelihood

The VAE can be viewed as amortized variational EM:

The encoder $q_\phi(z|x)$ replaces the E-step. It amortizes inference by learning a single network that works for all $x$ , rather than running separate optimization for each data point
The decoder $p_\theta(x|z)$ corresponds to the M-step
Both are optimized jointly via gradient descent on the ELBO

Classical variational inference computes a separate $q(z)$ for each observation (expensive). Amortization is what makes VAEs scalable: one forward pass through the encoder gives the approximate posterior for any $x$ .

Modern VAE Variants

The original Kingma-Welling VAE is the simplest member of a large family. Three extensions matter most in current practice.

VQ-VAE: Discrete Latents

The vector-quantized VAE (van den Oord, Vinyals, Kavukcuoglu 2017) replaces the continuous Gaussian latent with a discrete code book. The encoder maps $x$ to a continuous vector $z_e(x)$ , which is then snapped to the nearest entry $e_k$ in a learned code book $\{e_1, \ldots, e_K\}$ :

$z_q(x) = e_{k^*}, \quad k^* = \arg\min_k \|z_e(x) - e_k\|_2$

The decoder receives $z_q$ . Because $\arg\min$ is non-differentiable, gradients are passed through via the straight-through estimator: $\nabla_{z_e} = \nabla_{z_q}$ . The training loss combines reconstruction with a code-book commitment loss $\|z_e(x) - \text{sg}[e_{k^*}]\|^2$ and a code-book update loss $\|\text{sg}[z_e(x)] - e_{k^*}\|^2$ (where $\text{sg}$ is stop-gradient).

VQ-VAE eliminates posterior collapse (the discrete latent must be used) and produces sharp samples. It is the dominant tokenizer architecture for image, audio, and video generation: Stable Diffusion's image tokenizer, DALL-E's discrete VAE, EnCodec for audio, and most modern multimodal models all use VQ-VAE-style quantization. VQ-VAE-2 (Razavi, van den Oord, Vinyals 2019) extends this to a hierarchy of code books at multiple resolutions.

Hierarchical VAEs and the Bridge to Diffusion

Stacking latent variables in a hierarchy $z_1 \to z_2 \to \cdots \to z_L \to x$ gives more expressive priors and posteriors. NVAE (Vahdat & Kautz, NeurIPS 2020) showed that with careful normalization and residual connections, hierarchical VAEs can match GAN-quality image generation.

The deepest connection is to diffusion models. Kingma et al. 2021 (Variational Diffusion Models) showed that a diffusion model is exactly a hierarchical VAE with a fixed (non-learned) Gaussian encoder and an infinitely deep latent hierarchy. The ELBO of a diffusion model is the limit of the hierarchical VAE ELBO as the number of timesteps $T \to \infty$ . This unifies the two dominant generative-modeling paradigms: VAEs and diffusion are the same family, differing only in encoder choice and depth.

Latent Diffusion Models (Rombach et al., CVPR 2022) make this explicit: train a VAE to compress images to a low-dimensional latent space, then train a diffusion model in that space. Stable Diffusion is precisely this construction.

Conditional VAE

The Conditional VAE (Sohn, Lee, Yan, NeurIPS 2015) introduces a condition $c$ (label, attribute, or context) into both encoder and decoder: $q_\phi(z | x, c)$ and $p_\theta(x | z, c)$ . The ELBO becomes

$\mathcal{L} = \mathbb{E}_{q_\phi(z|x,c)}[\log p_\theta(x|z, c)] - D_{\text{KL}}(q_\phi(z|x,c) \| p(z|c))$

This is the standard formulation for class-conditional or attribute-conditional generation, and the foundation for any controllable VAE-style generative model.

Posterior Collapse and Its Mitigations

Posterior collapse (the encoder learning $q(z|x) = p(z)$ , KL $\to 0$ ) is the central failure mode of VAEs with powerful decoders. Standard mitigations:

KL annealing (Bowman et al., CoNLL 2016): start with KL weight $\beta = 0$ and ramp to $\beta = 1$ over training. Lets the encoder establish informative representations before the regularizer activates.
Free bits (Kingma et al., NeurIPS 2016): clamp the per-dimension KL at a minimum value $\lambda$ , so dimensions with KL below $\lambda$ are not penalized. Forces a minimum amount of information through the latent.
Constrain the decoder: weaken the autoregressive decoder so it cannot model $p(x)$ alone. Forces the latent to carry information.
β-VAE with $\beta < 1$ : trade tightness of the bound for a more informative latent. Loses the strict ELBO interpretation.
IWAE bound (Burda, Grosse, Salakhutdinov, ICLR 2016): use $K$ importance samples $z^{(1)}, \ldots, z^{(K)} \sim q$ to construct a tighter bound $\mathcal{L}_K = \mathbb{E}\left[\log \frac{1}{K}\sum_k \frac{p(x, z^{(k)})}{q(z^{(k)}|x)}\right]$ . $\mathcal{L}_K \uparrow \log p(x)$ as $K \to \infty$ .

Common Confusions

Watch Out

The KL term is not just a penalty. It has a precise information-theoretic meaning

A common misunderstanding is that the KL term is a "regularizer" added for convenience, like weight decay. It is not. The KL term arises necessarily from the ELBO derivation. It measures how much information the encoder extracts about the specific input $x$ beyond what the prior already provides. Re-weighting the KL term (as in β-VAE; Higgins et al., ICLR 2017) changes the probabilistic semantics: with $\beta \geq 1$ the objective remains a valid (just looser) lower bound on $\log p(x)$ , since increasing the weight of a non-negative KL only decreases the bound. With $\beta < 1$ the objective is no longer guaranteed to lower-bound $\log p(x)$ and you lose the ELBO interpretation entirely.

Watch Out

VAEs do not optimize reconstruction plus a penalty

The ELBO looks like "reconstruction - KL", which tempts people to treat it as a penalized autoencoder. But the correct interpretation is: the ELBO is a lower bound on the log-evidence, derived from first principles. The reconstruction and KL terms are not independent objectives. They are two parts of a single variational inference procedure. Changing their relative weight changes the probabilistic semantics.

Watch Out

Posterior collapse is not a bug in the ELBO

When a powerful autoregressive decoder can model $p(x)$ without using $z$ , the optimal solution sets $q(z|x) = p(z)$ (zero KL) and ignores the latent variables. This is actually the correct ELBO optimum. The model has discovered that latent variables are unnecessary. Whether this is desirable depends on whether you want meaningful latent representations (often yes) or just good $p(x)$ (then it is fine).

Canonical Examples

Example

VAE on MNIST

Encoder: two-layer MLP mapping $784 \to 512 \to 2d$ (outputting $\mu$ and $\log\sigma^2$ , each $d$ -dimensional). Decoder: MLP mapping $d \to 512 \to 784$ with sigmoid output (Bernoulli likelihood). With $d = 2$ , the latent space can be directly visualized: different digit classes cluster in different regions, and interpolating between two latent codes produces smooth morphing between digits.

Summary

The ELBO: $\log p(x) \geq \mathbb{E}_q[\log p(x|z)] - D_{\text{KL}}(q(z|x) \| p(z))$
Gap between $\log p(x)$ and ELBO is $D_{\text{KL}}(q(z|x) \| p(z|x))$
Reparameterization trick: $z = \mu + \sigma \odot \varepsilon$ enables backprop through sampling
KL term has closed form for Gaussian $q$ and Gaussian prior
VAE = amortized variational EM: encoder amortizes the E-step
The KL term is not a regularizer. It is part of the variational bound

Exercises

ExerciseCore

Problem

Derive the closed-form KL divergence between $q = \mathcal{N}(\mu, \sigma^2)$ (univariate) and $p = \mathcal{N}(0, 1)$ .

ExerciseAdvanced

Problem

Show that $\log p(x) = \mathcal{L}(\theta, \phi; x) + D_{\text{KL}}(q_\phi(z|x) \| p_\theta(z|x))$ . Use this to explain why maximizing the ELBO tightens the bound.

ExerciseResearch

Problem

In the beta-VAE, the objective is $\mathbb{E}_q[\log p(x|z)] - \beta \cdot D_{\text{KL}}(q(z|x) \| p(z))$ with $\beta > 1$ . This is still a valid lower bound on $\log p(x)$ (just strictly looser than the ELBO, since increasing the weight of a non-negative KL term can only decrease the bound). What is the beta-VAE actually optimizing from an information-theoretic perspective, and why does $\beta>1$ nevertheless change the learned representation?

Related Comparisons

References

Canonical:

Kingma & Welling, "Auto-Encoding Variational Bayes" (ICLR 2014; arXiv:1312.6114). The original VAE paper.
Rezende, Mohamed, Wierstra, "Stochastic Backpropagation and Approximate Inference in Deep Generative Models" (ICML 2014; arXiv:1401.4082). Independent contemporaneous derivation.
Doersch, "Tutorial on Variational Autoencoders" (2016; arXiv:1606.05908). The standard introductory tutorial.
Kingma & Welling, "An Introduction to Variational Autoencoders" (Foundations and Trends in ML, 2019; arXiv:1906.02691). Comprehensive tutorial monograph.

Discrete latents and tokenizers:

van den Oord, Vinyals, Kavukcuoglu, "Neural Discrete Representation Learning" (NeurIPS 2017; arXiv:1711.00937). The VQ-VAE paper. Most-deployed VAE variant.
Razavi, van den Oord, Vinyals, "Generating Diverse High-Fidelity Images with VQ-VAE-2" (NeurIPS 2019; arXiv:1906.00446). Hierarchical VQ-VAE.
Esser, Rombach, Ommer, "Taming Transformers for High-Resolution Image Synthesis" (CVPR 2021; arXiv:2012.09841). VQGAN: VQ-VAE with adversarial and perceptual losses.

Disentanglement and rate-distortion:

Higgins et al., " $\beta$ -VAE: Learning Basic Visual Concepts with a Constrained Variational Framework" (ICLR 2017). The β-VAE paper.
Burda, Grosse, Salakhutdinov, "Importance Weighted Autoencoders" (ICLR 2016; arXiv:1509.00519). The IWAE bound.
Locatello et al., "Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations" (ICML 2019 best paper; arXiv:1811.12359). Impossibility result for unsupervised disentanglement.
Alemi et al., "Fixing a Broken ELBO" (ICML 2018; arXiv:1711.00464). Rate-distortion analysis of the ELBO.
Hoffman & Johnson, "ELBO Surgery: Yet Another Way to Carve Up the Variational Evidence Lower Bound" (NeurIPS 2016 Workshop). Decomposes the ELBO into mutual-information and marginal-KL terms.
Tolstikhin, Bousquet, Gelly, Schölkopf, "Wasserstein Auto-Encoders" (ICLR 2018; arXiv:1711.01558). Replaces KL with optimal-transport divergence.

Conditional and hierarchical:

Sohn, Lee, Yan, "Learning Structured Output Representation using Deep Conditional Generative Models" (NeurIPS 2015). The Conditional VAE.
Vahdat & Kautz, "NVAE: A Deep Hierarchical Variational Autoencoder" (NeurIPS 2020; arXiv:2007.03898). State-of-the-art hierarchical VAE.
Sønderby, Raiko, Maaløe, Sønderby, Winther, "Ladder Variational Autoencoders" (NeurIPS 2016; arXiv:1602.02282). Earlier hierarchical VAE with bottom-up + top-down inference.

Posterior collapse:

Bowman et al., "Generating Sentences from a Continuous Space" (CoNLL 2016; arXiv:1511.06349). KL annealing.
Kingma, Salimans, Jozefowicz, Chen, Sutskever, Welling, "Improving Variational Inference with Inverse Autoregressive Flow" (NeurIPS 2016; arXiv:1606.04934). Free bits.
He, Spokoyny, Neubig, Berg-Kirkpatrick, "Lagging Inference Networks and Posterior Collapse in Variational Autoencoders" (ICLR 2019; arXiv:1901.05534). Diagnosis and fix.

Bridge to diffusion:

Kingma, Salimans, Poole, Ho, "Variational Diffusion Models" (NeurIPS 2021; arXiv:2107.00630). Diffusion as the infinite-depth limit of hierarchical VAEs.
Rombach, Blattmann, Lorenz, Esser, Ommer, "High-Resolution Image Synthesis with Latent Diffusion Models" (CVPR 2022; arXiv:2112.10752). Stable Diffusion: VAE + diffusion.

Next Topics

The natural next steps from VAEs:

Diffusion models: a different approach to tractable generative modeling
Normalizing flows: exact likelihood via invertible transformations
Variational inference: the general framework behind the ELBO

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

7

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
KL Divergencelayer 1 · tier 1
Bayesian Estimationlayer 0B · tier 2
Information Theory Foundationslayer 0B · tier 2
Autoencoderslayer 2 · tier 2

Derived topics

7

Diffusion Modelslayer 4 · tier 1
Representation Learning Theorylayer 3 · tier 2
JEPA and Joint Embeddinglayer 4 · tier 2
Normalizing Flowslayer 3 · tier 3
Autoencoders for Single-Cell RNA-seqlayer 4 · tier 3

+2 more on the derived-topics page.

Graph-backed continuations

Diffusion Models Normalizing Flows Autoencoders for Single-Cell RNA-seq Deep Generative Models for Molecules JEPA and Joint Embedding Predictive Coding and Autoencoders in the Brain Representation Learning Theory