Mathematical Infrastructure
Information Geometry
Riemannian geometry on the space of probability distributions: the Fisher information metric, natural gradient descent, exponential families as dually flat manifolds, and the connection to mirror descent.
Prerequisites
Why This Matters
Standard gradient descent treats parameter space as Euclidean: a step of size in always means the same thing regardless of where you are. But a small change in can produce a huge change in the distribution in one region and a negligible change in another. The Fisher information metric measures distances in distribution space, not parameter space.
The natural gradient (gradient with respect to the Fisher metric) corrects for this. It is invariant to reparameterization: the update is the same whether you parameterize a Gaussian by or or . This invariance is why natural gradient methods often converge faster than vanilla SGD.
Formal Setup
Statistical Manifold
A statistical manifold is a family of probability distributions parameterized by , where the map is smooth and injective.
Fisher Information Metric
The Fisher information metric on is the Riemannian metric tensor:
where is the Fisher information matrix.
The Fisher metric is the unique (up to scale) Riemannian metric on that is invariant under sufficient statistics. This is the Chentsov theorem, originally proved for finite sample spaces (Markov morphisms between finite probability simplices); the extension to smooth families on continuous sample spaces is due to Ay, Jost, Lê, and Schwachhöfer (2015).
Main Theorems
Fisher Metric as KL Hessian
Statement
The Fisher information metric equals the Hessian of the KL divergence at :
Equivalently, for nearby parameters:
Intuition
The KL divergence is the natural "distance" between distributions (though not symmetric). Its local quadratic approximation is the Fisher metric. So the Fisher metric measures how fast distributions change in the KL sense as you move in parameter space.
Proof Sketch
Expand around to second order. The first-order term vanishes because the score has zero mean. The second-order term gives , and the Hessian of the negative log-likelihood equals the Fisher information under standard regularity conditions.
Why It Matters
This result connects three perspectives: the Riemannian metric on the statistical manifold, the curvature of the KL divergence, and the Fisher information from estimation theory. It is the foundation for natural gradient descent.
Failure Mode
The approximation is only valid for small . For large parameter changes, the KL divergence and the quadratic form can differ substantially. This is why trust region methods (like TRPO) enforce explicit constraints on the step size.
Natural Gradient Invariance
Statement
The natural gradient of a loss is:
If is a smooth reparameterization, then the natural gradient update in -coordinates produces the same distribution update as in -coordinates. That is, the trajectory in distribution space is invariant to the choice of parameterization.
Intuition
Euclidean gradient descent moves in the direction of steepest descent measured in norm on parameter space. Natural gradient descent moves in the direction of steepest descent measured in KL divergence on distribution space. Since KL divergence is a property of distributions (not parameters), the result does not depend on how you parameterize the family.
Proof Sketch
Under the reparameterization , the Fisher metric transforms as where . The gradient transforms as . Therefore , which is exactly the Jacobian-transformed version of the natural gradient in -coordinates.
Why It Matters
Parameterization invariance means you do not need to worry about choosing the "right" parameterization. For neural networks, this is significant because the loss landscape depends heavily on the parameterization. Adam and other adaptive methods approximate the natural gradient by using diagonal approximations to .
Failure Mode
Computing is for parameters, which is intractable for neural networks with millions of parameters. Practical approximations (K-FAC, diagonal Fisher, empirical Fisher) trade off accuracy for computational feasibility, and these approximations break the invariance property.
Exponential Families and Dual Flatness
Exponential families have a special place in information geometry. An exponential family admits two natural coordinate systems:
- Natural parameters (the canonical coordinates)
- Expectation parameters
These are related by the Legendre transform: . The manifold is dually flat: it is flat in both coordinate systems simultaneously, with a pair of dual affine connections.
The KL divergence takes a simple form called the Bregman divergence:
Here and denote the same distribution written in the two coordinate systems, linked by the Legendre relation . The switch from on the left to on the right is purely a change of coordinates: the KL divergence is a function of distributions, and can be written equivalently in natural or expectation parameters.
Connection to Mirror Descent
Mirror descent with the log-partition function as the mirror map is equivalent to natural gradient descent on an exponential family. The defining feature of mirror descent is that the linear gradient step is applied in the dual coordinates , not in the primal :
equivalently . A naive update in natural parameters is not mirror descent and does not coincide with natural gradient on the exponential family — it is ordinary Euclidean gradient descent in . The dual-coordinate update is what makes the step a Bregman-proximal step with divergence , and what reproduces the natural-gradient direction for exponential families (since the Fisher information is ).
This unifies two important optimization frameworks:
- Natural gradient: use the Fisher metric to precondition gradients
- Mirror descent: use a Bregman divergence to define the proximity term
For exponential families with mirror map , these are the same algorithm.
α-Divergences and α-Connections
Amari's -divergences form a one-parameter family
that interpolates between standard divergences. Taking the limits via L'Hôpital's rule on Amari's formula: recovers the KL divergence , recovers the reverse KL , and gives (twice) the squared Hellinger distance. Each induces an affine connection on the statistical manifold. The connection (the e-connection) makes exponential families flat, while the connection (the m-connection) makes mixture families flat. The Fisher metric together with the dual pair is the core object of dually flat information geometry. Conventions for the sign of vary across the literature (Amari's original is the negative of Chentsov's); we follow Amari, Information Geometry and Its Applications (2016), Chapter 3.
Pythagorean Theorem for e- and m-Projections
Dually flat manifolds satisfy a Pythagorean identity for the KL divergence. Let be an m-flat submanifold (flat in the mixture connection) and an e-flat submanifold (flat in the exponential connection), intersecting at a single point . For any and with and orthogonal at under the Fisher metric,
The point is simultaneously the m-projection of onto and the e-projection of onto . This is foundational for EM-style alternating projection algorithms and for understanding variational inference as projection onto an e-flat family. See Amari, Information Geometry and Its Applications (2016), §2.8, or Amari and Nagaoka, Methods of Information Geometry (AMS, 2000).
Natural Gradient in Reinforcement Learning
Natural gradient methods are central to modern policy optimization. Kakade's Natural Policy Gradient (NPG) preconditions the policy gradient by the Fisher information of the policy distribution, giving parameterization-invariant updates on the policy manifold (Kakade, NeurIPS 2001). Trust Region Policy Optimization (TRPO) turns this into a practical algorithm by imposing a KL trust region and solving the resulting constrained problem via a conjugate-gradient approximation to the Fisher inverse (Schulman, Levine, Abbeel, Jordan, and Moritz, arXiv:1502.05477, 2015). Proximal Policy Optimization (PPO) replaces the hard KL constraint with a clipped surrogate objective that is cheaper to optimize while retaining the trust-region intuition (Schulman, Wolski, Dhariwal, Radford, and Klimov, arXiv:1707.06347, 2017).
K-FAC and Shampoo as Practical Natural-Gradient Approximations
K-FAC (Kronecker-Factored Approximate Curvature) approximates the Fisher information of a neural network with a block-diagonal matrix whose per-layer blocks factor as a Kronecker product of input and gradient covariance matrices (Martens and Grosse, arXiv:1503.05671, 2015). This makes inversion tractable and yields a natural-gradient-style preconditioner. Shampoo maintains per-layer left and right preconditioners computed from accumulated gradient statistics, applied as , and can be viewed as a structured approximation to full-matrix Adagrad closely related to block-diagonal natural gradient (Gupta, Koren, and Singer, arXiv:1802.09568, 2018). Both methods trade exact Fisher structure for computational feasibility and have been deployed in large-scale training.
Wasserstein Information Geometry
An alternative Riemannian structure on the space of probability measures comes from optimal transport: the 2-Wasserstein distance induces a formal Riemannian metric whose tangent space at consists of vector fields with inner product (the Otto calculus). Under this metric, the gradient flow of the KL divergence with respect to a reference measure is the Fokker-Planck equation, which connects optimal transport to diffusions and to entropic regularization. The Wasserstein inner product is distinct from the Fisher-Rao inner product: Fisher-Rao measures distances in distribution space using score statistics, while measures distances by how mass must be moved in the underlying sample space. See Otto, "The geometry of dissipative evolution equations: the porous medium equation" (Comm. PDE, 2001), and Villani, Optimal Transport: Old and New (Springer, 2009).
Common Confusions
The empirical Fisher is not the true Fisher
In practice, many implementations compute the "Fisher information matrix" using averaged over training data. This is the empirical Fisher, which equals the true Fisher only when the model is correctly specified (i.e., the true distribution is in the family). The empirical Fisher is still a PSD outer-product average, so it defines a (possibly degenerate) Riemannian form; the real issue is that under misspecification it converges to the wrong limit and is no longer invariant under sufficient statistics, so it is not the information-theoretically correct Fisher metric for the model family. See Kunstner, Hennig, and Balles (2019) and Martens (2020, §5).
Adam is not the natural gradient
Adam uses as a diagonal preconditioner. This resembles a diagonal approximation to , but Adam uses squared gradients of the loss, not squared score functions. The two coincide only for specific loss functions and model classes.
Canonical Examples
Natural gradient for a Gaussian mean
Consider with known variance. The Fisher information is . The natural gradient of a loss is . This rescales the gradient by the variance, taking larger steps when the distribution is broad (uncertain) and smaller steps when it is narrow (confident). Now reparameterize as . Standard gradient descent gives a different trajectory in distribution space, but natural gradient gives the same trajectory.
Summary
- The Fisher information metric is the Hessian of KL divergence at zero separation
- Natural gradient is invariant to reparameterization
- Exponential families are dually flat: natural parameters and expectation parameters are Legendre dual
- Mirror descent with the log-partition function equals natural gradient on exponential families
- Computing the exact natural gradient is ; practical methods use approximations
Exercises
Problem
For a Bernoulli distribution with , compute the Fisher information and write down the natural gradient update for minimizing a loss .
Problem
Show that for an exponential family with log-partition function , the Fisher information matrix equals the Hessian of : . Use this to prove the following equivalence: ordinary gradient descent in expectation parameters corresponds exactly to natural gradient descent in the natural parameters .
Further directions
- Fisher-Rao distance and geodesics
- EM algorithm as alternating e- and m-projections
- Stein variational gradient descent (Liu-Wang 2016)
- Information geometry of mixture models (m-flat) vs exponential families (e-flat) as dual-flat structures
- Interactive diagram: natural gradient vs ordinary gradient on a parametric family
- Quiz
References
Canonical:
- Amari, Information Geometry and Its Applications (2016), Chapters 1-4
- Amari, "Natural Gradient Works Efficiently in Learning" (Neural Computation 1998)
- Chentsov, Statistical Decision Rules and Optimal Inference (1982 English translation), AMS Translations Vol. 53 (finite sample space uniqueness of the Fisher metric)
Current:
- Martens, "New Insights and Perspectives on the Natural Gradient Method" (JMLR 2020), §5 on empirical Fisher
- Raskutti & Mukherjee, "The Information Geometry of Mirror Descent" (2015)
- Ay, Jost, Lê, and Schwachhöfer, "Information geometry and sufficient statistics" (Probability Theory and Related Fields, 2015), extending Chentsov to continuous sample spaces
- Kunstner, Hennig, and Balles, "Limitations of the Empirical Fisher Approximation for Natural Gradient Descent" (NeurIPS 2019, arXiv:1905.12558)
- Nielsen, "An Elementary Introduction to Information Geometry" (Entropy 2020, arXiv:1808.08271)
Next Topics
- Optimizer theory (SGD, Adam, Muon): practical approximations to the natural gradient
- Mean field theory: information geometry in variational inference
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
- Convex Dualitylayer 2 · tier 1
- Non-Euclidean and Hyperbolic Geometrylayer 1 · tier 2
- Whitening and Decorrelationlayer 2 · tier 2
Derived topics
2- Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1
- Mean Field Theorylayer 4 · tier 2
Graph-backed continuations