Normalizing Flows

Sneiderman, Robby

ML Methods

Normalizing Flows

Generative models that transform a simple base distribution through invertible mappings, enabling exact log-likelihood computation via the change of variables formula.

AdvancedTier 3StableSupporting~50 min

Prerequisites

Common Probability Distributions The Jacobian Matrix Variational Autoencoders

Prereq Map

Why This Matters

Normalizing flows give exact, tractable log-likelihoods via the change-of-variables formula, with no variational lower bounds or adversarial losses. They are not unique in admitting exact likelihoods: autoregressive models (PixelCNN, autoregressive transformers) also factor $\log p(x) = \sum_i \log p(x_i \mid x_{<i})$ exactly, and discrete VAEs with discrete latents can compute the log-likelihood exactly via marginalization. What is special about flows is that they give exact likelihoods and exact one-shot sampling in the same model: invertibility makes both directions of the change-of-variables formula tractable. They have largely been displaced by diffusion models for image generation because the invertibility requirement constrains architecture.

Understanding flows is still valuable: they clarify what you gain and lose by requiring invertibility, and the change-of-variables formula underlies many other methods including continuous normalizing flows and flow matching.

The Core Idea

Start with a simple base distribution $p_Z(z) = \mathcal{N}(z; 0, I)$ . Apply a sequence of invertible, differentiable transformations $f = f_K \circ f_{K-1} \circ \cdots \circ f_1$ to get $x = f(z)$ . The density of $x$ is determined by the change of variables formula.

Definition

Normalizing Flow

A normalizing flow is a sequence of invertible transformations $f_1, f_2, \ldots, f_K$ mapping a base distribution $p_Z$ to a target distribution $p_X$ . "Normalizing" refers to the change of variables that ensures the transformed density integrates to 1. "Flow" refers to the successive transformations that warp the density.

The Change of Variables Formula

Theorem

Change of Variables for Normalizing Flows

Statement

If $x = f(z)$ where $f: \mathbb{R}^d \to \mathbb{R}^d$ is a diffeomorphism and $z \sim p_Z$ , then:

$\log p_X(x) = \log p_Z(f^{-1}(x)) - \log \left|\det \frac{\partial f}{\partial z}\bigg|_{z=f^{-1}(x)}\right|$

For a composition $f = f_K \circ \cdots \circ f_1$ :

$\log p_X(x) = \log p_Z(z_0) - \sum_{k=1}^{K} \log \left|\det \frac{\partial f_k}{\partial z_{k-1}}\right|$

where $z_0 = f^{-1}(x)$ and $z_k = f_k(z_{k-1})$ .

Intuition

The Jacobian determinant measures how much $f$ locally stretches or compresses volume. If $f$ expands a region by factor 10, the density in that region must decrease by factor 10 to keep the total probability at 1. The log-determinant accounts for this volume change.

Proof Sketch

Start from the requirement $\int p_X(x) dx = 1$ . Substitute $x = f(z)$ , so $dx = |\det J_f| dz$ . Then $p_X(f(z)) |\det J_f| = p_Z(z)$ , giving $p_X(x) = p_Z(f^{-1}(x)) / |\det J_f|$ . Take logs.

Why It Matters

This is the entire basis of normalizing flows. Unlike VAEs (which optimize a lower bound) or GANs (which use adversarial training), flows optimize the exact log-likelihood directly. No approximation, no mode collapse, no posterior gap. The cost is that you must design $f$ so that both $f^{-1}$ and $\det J_f$ are tractable to compute.

Failure Mode

Computing $\det J_f$ for a general $d \times d$ matrix costs $O(d^3)$ . For high-dimensional data (images with $d > 10^4$ ), this is prohibitive unless the Jacobian has special structure (triangular, block-diagonal, etc.). This architectural constraint is the central limitation of flows.

report a correction →

Architectural Solutions

Coupling Layers (RealNVP)

Proposition

Coupling Layer Jacobian is Triangular

Statement

For a coupling layer that splits $z = (z_a, z_b)$ and computes:

$x_a = z_a, \quad x_b = z_b \odot \exp(s(z_a)) + t(z_a)$

where $s$ and $t$ are arbitrary neural networks, the Jacobian is lower triangular with determinant:

$\det J = \prod_j \exp(s(z_a)_j) = \exp\left(\sum_j s(z_a)_j\right)$

This costs $O(d)$ to compute, not $O(d^3)$ .

Intuition

Since $x_a = z_a$ (identity), the top-left block of the Jacobian is $I$ . Since $x_b$ depends on $z_a$ only through $s$ and $t$ , the off-diagonal block structure makes the Jacobian triangular. The determinant of a triangular matrix is the product of diagonal entries.

Proof Sketch

Write the full Jacobian in block form: $J = \begin{pmatrix} I & 0 \\ \partial x_b / \partial z_a & \text{diag}(\exp(s(z_a))) \end{pmatrix}$ . The determinant of a block-triangular matrix is the product of the determinants of the diagonal blocks: $\det(I) \cdot \det(\text{diag}(\exp(s))) = \exp(\sum s_j)$ .

Why It Matters

This is the key architectural trick that makes flows practical. The networks $s$ and $t$ can be arbitrarily complex (deep ResNets, attention layers) without affecting the cost of the log-determinant computation. Expressiveness comes from stacking many coupling layers with alternating partitions.

Failure Mode

A single coupling layer leaves half the dimensions unchanged. You need to alternate which dimensions are "active" across layers. With poor alternation patterns, some dimensions may never interact, limiting expressiveness.

report a correction →

Autoregressive Flows

Autoregressive flows (MAF, IAF) use the autoregressive property: dimension $x_i$ depends only on $x_1, \ldots, x_{i-1}$ . The Jacobian is triangular by construction.

MAF (Masked Autoregressive Flow): fast density evaluation (parallel), slow sampling (sequential, one dimension at a time).

IAF (Inverse Autoregressive Flow): fast sampling (parallel), slow density evaluation. The inverse of MAF.

The tradeoff between MAF and IAF is a direct consequence of the asymmetry between forward and inverse passes in autoregressive models.

Why Flows Lost to Diffusion

Flows require exact invertibility, which constrains architecture: input and output must have the same dimensionality, and every layer must be invertible. This prevents using standard architectures (U-Nets, standard ResNets). Diffusion models avoid this by learning a denoising process that does not require invertibility, allowing more expressive architectures. The result: diffusion models achieve better sample quality on images with simpler training procedures.

Flows remain useful for density estimation, variational inference (as flexible posterior approximations), and physics simulations where exact likelihood matters.

Common Confusions

Watch Out

Flows are not just fancy coordinate transforms

While each layer is a coordinate transformation, the composition of many layers with learned parameters can represent highly complex distributions. The universal approximation results for flows (Huang et al., 2018) show that sufficiently deep flows can approximate any target density.

Watch Out

The base distribution choice matters less than you think

A standard Gaussian base is used in nearly all flow models. The flow layers are expressive enough to warp any unimodal base into a complex multimodal target. Using a more complex base distribution rarely helps in practice.

Summary

Flows compute exact log-likelihoods via the change of variables formula
The computational bottleneck is the Jacobian determinant, which costs $O(d^3)$ in general but $O(d)$ with coupling or autoregressive structure
Coupling layers (RealNVP) let $s$ and $t$ be arbitrary networks while keeping the determinant tractable
MAF is fast for density evaluation; IAF is fast for sampling
Diffusion models displaced flows for image generation because invertibility constrains architecture, but flows remain valuable where exact likelihood is needed

Exercises

ExerciseCore

Problem

Write the change of variables formula for a 1D normalizing flow $x = f(z) = z^3$ where $z \sim \mathcal{N}(0,1)$ . What is $p_X(x)$ ?

ExerciseAdvanced

Problem

A coupling layer splits $z \in \mathbb{R}^4$ as $z_a = (z_1, z_2)$ and $z_b = (z_3, z_4)$ . The scale network outputs $s(z_a) = (1.0, -0.5)$ and the translation network outputs $t(z_a) = (0.3, 0.7)$ . Compute the output $x$ and the log-determinant of the Jacobian when $z = (1, 2, 3, 4)$ .

References

Canonical:

Dinh et al., "Density estimation using Real-NVP" (2017), Section 3
Rezende & Mohamed, "Variational Inference with Normalizing Flows" (2015), Section 3

Current:

Papamakarios et al., "Normalizing Flows for Probabilistic Modeling and Inference" (2021), Chapters 3-4
Kobyzev et al., "Normalizing Flows: An Introduction and Review" (2020)

Next Topics

Diffusion models: the generative paradigm that displaced flows
Energy-based models: density modeling without normalization

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Common Probability Distributionslayer 0A · tier 1
The Jacobian Matrixlayer 0A · tier 1
Variational Autoencoderslayer 3 · tier 1

Derived topics

4

Diffusion Modelslayer 4 · tier 1
Continuous Normalizing Flowslayer 3 · tier 3
Energy-Based Modelslayer 3 · tier 3
Deep Generative Models for Cosmic Structureslayer 4 · tier 3

Graph-backed continuations

Diffusion Models Energy-Based Models Continuous Normalizing Flows Deep Generative Models for Cosmic Structures