Optimization Function Classes
Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient
Optimizers that use curvature information to precondition gradients: the natural gradient via Fisher information, K-FAC's Kronecker approximation, and Shampoo's full-matrix preconditioning. How they connect to Riemannian optimization and why they outperform Adam on certain architectures.
Prerequisites
Why This Matters
Adam preconditions each parameter independently using running averages of squared gradients. This is diagonal preconditioning: each weight gets its own learning rate, but correlations between weights are ignored.
Preconditioned optimizers go further. They use the full (or approximated) curvature structure of the loss landscape to transform gradients before taking a step. The result: updates that account for how parameters interact, not just how large each gradient is. In practice, this means faster convergence on problems with strong parameter correlations, which includes most neural networks.
The cost is computational: maintaining and applying a preconditioner is more expensive per step than Adam. The question is whether the improved convergence compensates. For large-scale training (where each step costs thousands of GPU-hours), the answer is increasingly yes.
The Natural Gradient
Natural Gradient Descent
Statement
The natural gradient of a loss is:
where is the Fisher information matrix.
Note: the definition above is for a generative model . For supervised learning, the relevant object is the conditional-predictive Fisher , where the outer expectation is over the data distribution and the inner over the model's predictive distribution. K-FAC is explicitly designed as a Kronecker-factored approximation to this Fisher (or the closely related generalized Gauss-Newton matrix). Shampoo is not derived from the Fisher; it is built from accumulated outer products of the observed gradients, more in the spirit of full-matrix AdaGrad. The two are both block-Kronecker preconditioners but capture different second-order objects.
The natural gradient update is:
This is steepest descent in the KL divergence geometry: it finds the direction that decreases the loss most per unit of distributional change .
Intuition
Euclidean gradient descent treats all parameter directions equally: a step of size in weight 1 is the same as a step of size in weight 1000. But in a neural network, some parameters have much larger effects on the output distribution than others. The Fisher information matrix captures these sensitivities.
Multiplying by rescales the gradient so that the update produces equal change in the output distribution in all directions. This is the same idea as Newton's method (which uses the Hessian instead of the Fisher), but adapted for probabilistic models.
Why It Matters
The natural gradient is parameterization-invariant: it gives the same update regardless of how you parameterize the model. Reparameterizing a neural network (e.g., scaling a weight matrix by a constant and dividing the next layer by the same constant) changes the Euclidean gradient but not the natural gradient. This is why natural gradient methods can be more robust to architecture choices.
Practically, natural gradient methods converge in fewer steps than Adam on problems with ill-conditioned Fisher matrices (which is most neural networks). The challenge is computing , which costs for parameters.
Failure Mode
The Fisher matrix has entries for parameters. For a model with 100M parameters, storing requires entries, which is impossible. Every practical preconditioned optimizer is an approximation to the natural gradient: diagonal (Adam), block-diagonal (K-FAC), Kronecker-factored (Shampoo), or low-rank.
K-FAC: Kronecker-Factored Approximate Curvature
K-FAC (Martens & Grosse, 2015) approximates the Fisher matrix for each layer using a Kronecker product factorization.
For a fully-connected layer with input and output pre-activation , the Fisher block for that layer's weights is:
where is the input covariance and is the gradient covariance. The Kronecker structure means:
Inverting the full Fisher block for a single layer would cost . K-FAC reduces this to by inverting the two Kronecker factors separately. Storage drops from for the full Fisher block to for the factors, a secondary but still significant benefit.
The Kronecker factorization is exact under independence of activations and pre-activation gradients (Martens and Grosse, 2015). This independence does not hold in general, even for linear networks, so K-FAC is an approximation in most settings. It is a good approximation empirically for most feedforward architectures and often exact up to small errors for linear networks under mild conditions.
Shampoo
Shampoo Update Rule
Statement
Shampoo (Gupta, Koren, Singer, 2018) maintains left and right preconditioners for each weight matrix:
The preconditioned update is:
The exponent comes from a Löwner-order (matrix AM-GM) bound, not from any vec-vs-matrix conversion. If is the full-matrix AdaGrad preconditioner on the vectorized gradient, and are obtained from partial traces of the terms, then Gupta, Koren, and Singer (2018) show in the Löwner order. By operator monotonicity of , we get . Using and the identity , applying as a matrix-shaped update and symmetrizing between the two factors yields . The Kronecker vec identity preserves exponents; there is no square-root introduction from switching between matrix and vectorized form.
Scope caveat: the power is a design choice, not uniquely determined. The Löwner bound only establishes that upper-bounds the AdaGrad preconditioner; splitting this symmetrically across the two factors gives , but other exponents are defensible. Recent work (Morwani, Shah, Cohen, Mhaskar, 2024; Anil, Gupta, Koren, Regan, Singer, 2020 Distributed Shampoo; SOAP) often prefers , which has shown comparable or better results in large-scale deployments. Empirically, intermediate exponents trade off preconditioning strength.
Intuition
Shampoo is a full-matrix AdaGrad approximation, not a Fisher approximation. It directly accumulates (left) and (right) from the observed gradients themselves. The left preconditioner captures correlations between output rows of the weight matrix; the right captures correlations between input columns. The relevant object being approximated is the AdaGrad full-matrix preconditioner on the vectorized gradient, upper-bounded in the Löwner order by . K-FAC, by contrast, targets the Fisher / Gauss-Newton matrix using a different factorization (input-activation covariance Kronecker pre-activation-gradient covariance). Both are block-Kronecker preconditioners but they approximate different second-order quantities.
The matrix fourth root arises from the operator-inequality chain described in the statement: the symmetric split of across the two factors. Empirically, intermediate exponents trade off preconditioning strength against stability, with and both in active use.
Why It Matters
Shampoo has shown strong empirical results on transformer training, sometimes matching or exceeding Adam with fewer steps (though each step is more expensive due to the matrix power computation). Google's Distributed Shampoo (Anil et al., 2020) is the production variant: preconditioner statistics and are sharded across data-parallel workers, the matrix powers and are recomputed every steps (typically to ) on dedicated CPU or GPU workers in parallel with training, and only the small matrix powers (not the full statistics) are communicated back. This amortizes the eigendecomposition cost across many training steps and hides it behind the forward-backward pass. The connection to Riemannian optimization is direct: Shampoo's update can be interpreted as Riemannian gradient descent on the space of weight matrices with a specific metric.
Failure Mode
Computing requires eigendecomposition of , costing per step. For large layers (), this is expensive. Practical implementations amortize this cost by recomputing the preconditioner every steps (e.g., ). The matrix power can also be approximated using Newton-Schulz iterations (same technique as Muon). Memory cost is per weight matrix for the preconditioners.
Comparison Table
| Optimizer | Preconditioner | Cost per step | Memory overhead | Convergence | Best for |
|---|---|---|---|---|---|
| SGD | None () | 0 | Slow on ill-conditioned | Convex, well-conditioned | |
| Adam | Diagonal () | (moments) | Good general-purpose | Default choice | |
| K-FAC | Kronecker () | Fast on FC layers | Large FC networks | ||
| Shampoo | Full matrix () | Fast, especially on matrices | Transformers, large models | ||
| Muon | Newton-Schulz approx. of orthogonal polar factor of | (Newton-Schulz) | Very fast on matrix weights | Transformer weights | |
| Parallel Muon / Turbo Muon | Muon with sharded Newton-Schulz iterations across DP/TP groups | wall-clock for workers | Comparable to Muon, lower wall-clock per step | Large-scale distributed training | |
| Natural gradient | Full Fisher () | Optimal in local quadratic / KL-geometry sense for exponential families; no general non-convex guarantee | Infeasible for large |
Common Confusions
Shampoo is not just Adam with bigger matrices
Adam uses diagonal preconditioning: each parameter gets its own adaptive learning rate based on its own squared gradient history. Shampoo uses full-matrix preconditioning: the update to parameter depends on the gradient history of all other parameters in the same weight matrix. This captures cross-parameter correlations that Adam misses entirely. The difference matters most when parameters are highly correlated (which they are in attention weight matrices).
The matrix fourth root comes from a Löwner bound, not from vec-vs-matrix form
A common incorrect story: "the exponent appears because vectorizing a matrix update introduces an extra square root." This is wrong. The Kronecker vec identity preserves exponents exactly; switching between matrix and vectorized form introduces nothing.
The actual argument (Gupta, Koren, Singer, 2018) is a Löwner-order bound. The Kronecker approximation upper-bounds the full AdaGrad preconditioner in the positive-definite order, so . Splitting this symmetrically across the two factors yields the exponent on each.
The exponent is also not uniquely determined. The Löwner bound fixes the total , but the symmetric split to on each factor is a design choice. Much of the recent literature (Distributed Shampoo, SOAP, Morwani et al. 2024) uses on each factor, with comparable or better results at scale. Treat the exponent as a tunable hyperparameter in the range , not as a uniquely derived constant.
More preconditioning is not always better
Full natural gradient () is theoretically optimal per step, but each step is vastly more expensive. The wall-clock time to reach a target loss depends on both the number of steps and the cost per step. Adam is cheap per step. Shampoo is expensive per step but takes fewer steps. The crossover depends on model size, hardware, and how ill-conditioned the problem is. For small models, Adam wins on wall-clock. For large transformers with many GPU-hours per step, Shampoo can win.
Exercises
Problem
Adam maintains running averages where is elementwise. Explain why this is equivalent to a diagonal approximation of the Fisher matrix and why it misses cross-parameter correlations.
Problem
For a weight matrix , compute the memory cost of Shampoo's preconditioners ( and ) compared to Adam's two moment buffers. At what matrix size does Shampoo's memory overhead exceed Adam's by more than 2x?
Related Comparisons
References
Canonical:
- Amari, "Natural Gradient Works Efficiently in Learning" (Neural Computation, 1998). The foundational paper on natural gradient.
- Martens & Grosse, "Optimizing Neural Networks with Kronecker-factored Approximate Curvature" (ICML 2015). K-FAC.
- Gupta, Koren, Singer, "Shampoo: Preconditioned Stochastic Tensor Optimization" (ICML 2018). The original Shampoo paper.
Current:
-
Anil, Gupta, Koren, Regan, Singer, "Scalable Second Order Optimization for Deep Learning" (2020/2021). Distributed Shampoo; often uses on each factor.
-
Morwani, Shah, Cohen, Mhaskar, "A New Perspective on Shampoo's Preconditioner" (2024). Re-derives Shampoo-style updates and argues for over exponents.
-
Vyas et al., "SOAP: Improving and Stabilizing Shampoo using Adam" (2024). Hybrid preconditioner using exponent on each factor.
-
Bernstein & Newhouse, "Old Optimizer, New Norm: An Anthology" (2024). Muon and the Newton-Schulz polar-factor connection.
-
Liu et al., "Parallel Muon for Scalable Distributed Pretraining" (2025), arXiv:2502.16982. Shards the Newton-Schulz iterations of Muon across data- and tensor-parallel groups so that the per-step optimizer cost matches Adam at large scale.
-
"Turbo Muon: Accelerating Muon via Polynomial-Iterate Reuse" (2025), arXiv:2512.04632. Reduces the constant factor of the Newton-Schulz polar-factor approximation by reusing intermediate iterates across consecutive optimizer steps.
-
Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5
Next Topics
- Riemannian optimization: the geometric framework underlying manifold-constrained updates
- Optimizer theory: SGD, Adam, Muon: how Muon approximates the orthogonal polar factor of the gradient via Newton-Schulz
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- The Hessian Matrixlayer 0A · tier 1
- Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
- Convex Optimization Basicslayer 1 · tier 1
- Conjugate Gradient Methodslayer 2 · tier 2
Derived topics
2- Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1
- Riemannian Optimization and Manifold Constraintslayer 3 · tier 2
Graph-backed continuations