Skip to main content

ML Methods

Normalizing Flows

Generative models that transform a simple base distribution through invertible mappings, enabling exact log-likelihood computation via the change of variables formula.

AdvancedTier 3StableSupporting~50 min

Why This Matters

Normalizing flows give exact, tractable log-likelihoods via the change-of-variables formula, with no variational lower bounds or adversarial losses. They are not unique in admitting exact likelihoods: autoregressive models (PixelCNN, autoregressive transformers) also factor logp(x)=ilogp(xix<i)\log p(x) = \sum_i \log p(x_i \mid x_{<i}) exactly, and discrete VAEs with discrete latents can compute the log-likelihood exactly via marginalization. What is special about flows is that they give exact likelihoods and exact one-shot sampling in the same model: invertibility makes both directions of the change-of-variables formula tractable. They have largely been displaced by diffusion models for image generation because the invertibility requirement constrains architecture.

Understanding flows is still valuable: they clarify what you gain and lose by requiring invertibility, and the change-of-variables formula underlies many other methods including continuous normalizing flows and flow matching.

The Core Idea

Start with a simple base distribution pZ(z)=N(z;0,I)p_Z(z) = \mathcal{N}(z; 0, I). Apply a sequence of invertible, differentiable transformations f=fKfK1f1f = f_K \circ f_{K-1} \circ \cdots \circ f_1 to get x=f(z)x = f(z). The density of xx is determined by the change of variables formula.

Definition

Normalizing Flow

A normalizing flow is a sequence of invertible transformations f1,f2,,fKf_1, f_2, \ldots, f_K mapping a base distribution pZp_Z to a target distribution pXp_X. "Normalizing" refers to the change of variables that ensures the transformed density integrates to 1. "Flow" refers to the successive transformations that warp the density.

The Change of Variables Formula

Theorem

Change of Variables for Normalizing Flows

Statement

If x=f(z)x = f(z) where f:RdRdf: \mathbb{R}^d \to \mathbb{R}^d is a diffeomorphism and zpZz \sim p_Z, then:

logpX(x)=logpZ(f1(x))logdetfzz=f1(x)\log p_X(x) = \log p_Z(f^{-1}(x)) - \log \left|\det \frac{\partial f}{\partial z}\bigg|_{z=f^{-1}(x)}\right|

For a composition f=fKf1f = f_K \circ \cdots \circ f_1:

logpX(x)=logpZ(z0)k=1Klogdetfkzk1\log p_X(x) = \log p_Z(z_0) - \sum_{k=1}^{K} \log \left|\det \frac{\partial f_k}{\partial z_{k-1}}\right|

where z0=f1(x)z_0 = f^{-1}(x) and zk=fk(zk1)z_k = f_k(z_{k-1}).

Intuition

The Jacobian determinant measures how much ff locally stretches or compresses volume. If ff expands a region by factor 10, the density in that region must decrease by factor 10 to keep the total probability at 1. The log-determinant accounts for this volume change.

Proof Sketch

Start from the requirement pX(x)dx=1\int p_X(x) dx = 1. Substitute x=f(z)x = f(z), so dx=detJfdzdx = |\det J_f| dz. Then pX(f(z))detJf=pZ(z)p_X(f(z)) |\det J_f| = p_Z(z), giving pX(x)=pZ(f1(x))/detJfp_X(x) = p_Z(f^{-1}(x)) / |\det J_f|. Take logs.

Why It Matters

This is the entire basis of normalizing flows. Unlike VAEs (which optimize a lower bound) or GANs (which use adversarial training), flows optimize the exact log-likelihood directly. No approximation, no mode collapse, no posterior gap. The cost is that you must design ff so that both f1f^{-1} and detJf\det J_f are tractable to compute.

Failure Mode

Computing detJf\det J_f for a general d×dd \times d matrix costs O(d3)O(d^3). For high-dimensional data (images with d>104d > 10^4), this is prohibitive unless the Jacobian has special structure (triangular, block-diagonal, etc.). This architectural constraint is the central limitation of flows.

Architectural Solutions

Coupling Layers (RealNVP)

Proposition

Coupling Layer Jacobian is Triangular

Statement

For a coupling layer that splits z=(za,zb)z = (z_a, z_b) and computes:

xa=za,xb=zbexp(s(za))+t(za)x_a = z_a, \quad x_b = z_b \odot \exp(s(z_a)) + t(z_a)

where ss and tt are arbitrary neural networks, the Jacobian is lower triangular with determinant:

detJ=jexp(s(za)j)=exp(js(za)j)\det J = \prod_j \exp(s(z_a)_j) = \exp\left(\sum_j s(z_a)_j\right)

This costs O(d)O(d) to compute, not O(d3)O(d^3).

Intuition

Since xa=zax_a = z_a (identity), the top-left block of the Jacobian is II. Since xbx_b depends on zaz_a only through ss and tt, the off-diagonal block structure makes the Jacobian triangular. The determinant of a triangular matrix is the product of diagonal entries.

Proof Sketch

Write the full Jacobian in block form: J=(I0xb/zadiag(exp(s(za))))J = \begin{pmatrix} I & 0 \\ \partial x_b / \partial z_a & \text{diag}(\exp(s(z_a))) \end{pmatrix}. The determinant of a block-triangular matrix is the product of the determinants of the diagonal blocks: det(I)det(diag(exp(s)))=exp(sj)\det(I) \cdot \det(\text{diag}(\exp(s))) = \exp(\sum s_j).

Why It Matters

This is the key architectural trick that makes flows practical. The networks ss and tt can be arbitrarily complex (deep ResNets, attention layers) without affecting the cost of the log-determinant computation. Expressiveness comes from stacking many coupling layers with alternating partitions.

Failure Mode

A single coupling layer leaves half the dimensions unchanged. You need to alternate which dimensions are "active" across layers. With poor alternation patterns, some dimensions may never interact, limiting expressiveness.

Autoregressive Flows

Autoregressive flows (MAF, IAF) use the autoregressive property: dimension xix_i depends only on x1,,xi1x_1, \ldots, x_{i-1}. The Jacobian is triangular by construction.

MAF (Masked Autoregressive Flow): fast density evaluation (parallel), slow sampling (sequential, one dimension at a time).

IAF (Inverse Autoregressive Flow): fast sampling (parallel), slow density evaluation. The inverse of MAF.

The tradeoff between MAF and IAF is a direct consequence of the asymmetry between forward and inverse passes in autoregressive models.

Why Flows Lost to Diffusion

Flows require exact invertibility, which constrains architecture: input and output must have the same dimensionality, and every layer must be invertible. This prevents using standard architectures (U-Nets, standard ResNets). Diffusion models avoid this by learning a denoising process that does not require invertibility, allowing more expressive architectures. The result: diffusion models achieve better sample quality on images with simpler training procedures.

Flows remain useful for density estimation, variational inference (as flexible posterior approximations), and physics simulations where exact likelihood matters.

Common Confusions

Watch Out

Flows are not just fancy coordinate transforms

While each layer is a coordinate transformation, the composition of many layers with learned parameters can represent highly complex distributions. The universal approximation results for flows (Huang et al., 2018) show that sufficiently deep flows can approximate any target density.

Watch Out

The base distribution choice matters less than you think

A standard Gaussian base is used in nearly all flow models. The flow layers are expressive enough to warp any unimodal base into a complex multimodal target. Using a more complex base distribution rarely helps in practice.

Summary

  • Flows compute exact log-likelihoods via the change of variables formula
  • The computational bottleneck is the Jacobian determinant, which costs O(d3)O(d^3) in general but O(d)O(d) with coupling or autoregressive structure
  • Coupling layers (RealNVP) let ss and tt be arbitrary networks while keeping the determinant tractable
  • MAF is fast for density evaluation; IAF is fast for sampling
  • Diffusion models displaced flows for image generation because invertibility constrains architecture, but flows remain valuable where exact likelihood is needed

Exercises

ExerciseCore

Problem

Write the change of variables formula for a 1D normalizing flow x=f(z)=z3x = f(z) = z^3 where zN(0,1)z \sim \mathcal{N}(0,1). What is pX(x)p_X(x)?

ExerciseAdvanced

Problem

A coupling layer splits zR4z \in \mathbb{R}^4 as za=(z1,z2)z_a = (z_1, z_2) and zb=(z3,z4)z_b = (z_3, z_4). The scale network outputs s(za)=(1.0,0.5)s(z_a) = (1.0, -0.5) and the translation network outputs t(za)=(0.3,0.7)t(z_a) = (0.3, 0.7). Compute the output xx and the log-determinant of the Jacobian when z=(1,2,3,4)z = (1, 2, 3, 4).

References

Canonical:

  • Dinh et al., "Density estimation using Real-NVP" (2017), Section 3
  • Rezende & Mohamed, "Variational Inference with Normalizing Flows" (2015), Section 3

Current:

  • Papamakarios et al., "Normalizing Flows for Probabilistic Modeling and Inference" (2021), Chapters 3-4
  • Kobyzev et al., "Normalizing Flows: An Introduction and Review" (2020)

Next Topics

  • Diffusion models: the generative paradigm that displaced flows
  • Energy-based models: density modeling without normalization

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

4