Skip to main content

Optimization Function Classes

Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient

Optimizers that use curvature information to precondition gradients: the natural gradient via Fisher information, K-FAC's Kronecker approximation, and Shampoo's full-matrix preconditioning. How they connect to Riemannian optimization and why they outperform Adam on certain architectures.

AdvancedTier 2FrontierSupporting~55 min

Why This Matters

Adam preconditions each parameter independently using running averages of squared gradients. This is diagonal preconditioning: each weight gets its own learning rate, but correlations between weights are ignored.

Preconditioned optimizers go further. They use the full (or approximated) curvature structure of the loss landscape to transform gradients before taking a step. The result: updates that account for how parameters interact, not just how large each gradient is. In practice, this means faster convergence on problems with strong parameter correlations, which includes most neural networks.

The cost is computational: maintaining and applying a preconditioner is more expensive per step than Adam. The question is whether the improved convergence compensates. For large-scale training (where each step costs thousands of GPU-hours), the answer is increasingly yes.

The Natural Gradient

Theorem

Natural Gradient Descent

Statement

The natural gradient of a loss L(θ)\mathcal{L}(\theta) is:

~L(θ)=F(θ)1L(θ)\tilde{\nabla} \mathcal{L}(\theta) = F(\theta)^{-1} \nabla \mathcal{L}(\theta)

where F(θ)=Exp(;θ)[logp(x;θ)logp(x;θ)]F(\theta) = \mathbb{E}_{x \sim p(\cdot; \theta)}[\nabla \log p(x; \theta) \nabla \log p(x; \theta)^\top] is the Fisher information matrix.

Note: the definition above is for a generative model p(x;θ)p(x; \theta). For supervised learning, the relevant object is the conditional-predictive Fisher F(θ)=ExD,yp(x;θ)[θlogp(yx;θ)θlogp(yx;θ)]F(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, y \sim p(\cdot \mid x; \theta)}[\nabla_\theta \log p(y \mid x; \theta) \nabla_\theta \log p(y \mid x; \theta)^\top], where the outer expectation is over the data distribution and the inner over the model's predictive distribution. K-FAC is explicitly designed as a Kronecker-factored approximation to this Fisher (or the closely related generalized Gauss-Newton matrix). Shampoo is not derived from the Fisher; it is built from accumulated outer products of the observed gradients, more in the spirit of full-matrix AdaGrad. The two are both block-Kronecker preconditioners but capture different second-order objects.

The natural gradient update is:

θt+1=θtηF(θt)1L(θt)\theta_{t+1} = \theta_t - \eta F(\theta_t)^{-1} \nabla \mathcal{L}(\theta_t)

This is steepest descent in the KL divergence geometry: it finds the direction that decreases the loss most per unit of distributional change DKL(pθ+δpθ)D_{\text{KL}}(p_{\theta + \delta} \| p_\theta).

Intuition

Euclidean gradient descent treats all parameter directions equally: a step of size ϵ\epsilon in weight 1 is the same as a step of size ϵ\epsilon in weight 1000. But in a neural network, some parameters have much larger effects on the output distribution than others. The Fisher information matrix captures these sensitivities.

Multiplying by F1F^{-1} rescales the gradient so that the update produces equal change in the output distribution in all directions. This is the same idea as Newton's method (which uses the Hessian instead of the Fisher), but adapted for probabilistic models.

Why It Matters

The natural gradient is parameterization-invariant: it gives the same update regardless of how you parameterize the model. Reparameterizing a neural network (e.g., scaling a weight matrix by a constant and dividing the next layer by the same constant) changes the Euclidean gradient but not the natural gradient. This is why natural gradient methods can be more robust to architecture choices.

Practically, natural gradient methods converge in fewer steps than Adam on problems with ill-conditioned Fisher matrices (which is most neural networks). The challenge is computing F1F^{-1}, which costs O(d3)O(d^3) for dd parameters.

Failure Mode

The Fisher matrix has d2d^2 entries for dd parameters. For a model with 100M parameters, storing FF requires 101610^{16} entries, which is impossible. Every practical preconditioned optimizer is an approximation to the natural gradient: diagonal (Adam), block-diagonal (K-FAC), Kronecker-factored (Shampoo), or low-rank.

K-FAC: Kronecker-Factored Approximate Curvature

K-FAC (Martens & Grosse, 2015) approximates the Fisher matrix for each layer using a Kronecker product factorization.

For a fully-connected layer with input aa and output pre-activation s=Was = Wa, the Fisher block for that layer's weights is:

FWE[aa]E[sLsL]=ASF_W \approx \mathbb{E}[aa^\top] \otimes \mathbb{E}[\nabla_s \mathcal{L} \nabla_s \mathcal{L}^\top] = A \otimes S

where AA is the input covariance and SS is the gradient covariance. The Kronecker structure means:

FW1vec(G)vec(S1GA1)F_W^{-1} \text{vec}(G) \approx \text{vec}(S^{-1} G A^{-1})

Inverting the full Fisher block for a single layer would cost O((dindout)3)O((d_{\text{in}} d_{\text{out}})^3). K-FAC reduces this to O(din3+dout3)O(d_{\text{in}}^3 + d_{\text{out}}^3) by inverting the two Kronecker factors separately. Storage drops from O(din2dout2)O(d_{\text{in}}^2 d_{\text{out}}^2) for the full Fisher block to O(din2+dout2)O(d_{\text{in}}^2 + d_{\text{out}}^2) for the factors, a secondary but still significant benefit.

The Kronecker factorization is exact under independence of activations and pre-activation gradients (Martens and Grosse, 2015). This independence does not hold in general, even for linear networks, so K-FAC is an approximation in most settings. It is a good approximation empirically for most feedforward architectures and often exact up to small errors for linear networks under mild conditions.

Shampoo

Proposition

Shampoo Update Rule

Statement

Shampoo (Gupta, Koren, Singer, 2018) maintains left and right preconditioners for each weight matrix:

Lt=(1β)Lt1+βGtGtRm×mL_t = (1 - \beta) L_{t-1} + \beta \, G_t G_t^\top \in \mathbb{R}^{m \times m} Rt=(1β)Rt1+βGtGtRn×nR_t = (1 - \beta) R_{t-1} + \beta \, G_t^\top G_t \in \mathbb{R}^{n \times n}

The preconditioned update is:

ΔWt=Lt1/4GtRt1/4\Delta W_t = L_t^{-1/4} \, G_t \, R_t^{-1/4}

The 1/4-1/4 exponent comes from a Löwner-order (matrix AM-GM) bound, not from any vec-vs-matrix conversion. If Ht=stgsgsH_t = \sum_{s \le t} g_s g_s^\top is the full-matrix AdaGrad preconditioner on the vectorized gradient, and Lt,RtL_t, R_t are obtained from partial traces of the gsgsg_s g_s^\top terms, then Gupta, Koren, and Singer (2018) show LtRtHtL_t \otimes R_t \succeq H_t in the Löwner order. By operator monotonicity of XX1/2X \mapsto X^{-1/2}, we get (LtRt)1/2Ht1/2(L_t \otimes R_t)^{-1/2} \preceq H_t^{-1/2}. Using (RL)1/2=R1/2L1/2(R \otimes L)^{-1/2} = R^{-1/2} \otimes L^{-1/2} and the identity (RaLa)vec(G)=vec(LaGRa)(R^{a} \otimes L^{a}) \text{vec}(G) = \text{vec}(L^{a} G R^{a}), applying (LtRt)1/2(L_t \otimes R_t)^{-1/2} as a matrix-shaped update and symmetrizing between the two factors yields ΔWt=Lt1/4GtRt1/4\Delta W_t = L_t^{-1/4} G_t R_t^{-1/4}. The Kronecker vec identity preserves exponents; there is no square-root introduction from switching between matrix and vectorized form.

Scope caveat: the 1/41/4 power is a design choice, not uniquely determined. The Löwner bound only establishes that (LR)1/2(L \otimes R)^{-1/2} upper-bounds the AdaGrad preconditioner; splitting this symmetrically across the two factors gives 1/41/4, but other exponents are defensible. Recent work (Morwani, Shah, Cohen, Mhaskar, 2024; Anil, Gupta, Koren, Regan, Singer, 2020 Distributed Shampoo; SOAP) often prefers 1/21/2, which has shown comparable or better results in large-scale deployments. Empirically, intermediate exponents p(0,1/2]p \in (0, 1/2] trade off preconditioning strength.

Intuition

Shampoo is a full-matrix AdaGrad approximation, not a Fisher approximation. It directly accumulates GGGG^\top (left) and GGG^\top G (right) from the observed gradients themselves. The left preconditioner captures correlations between output rows of the weight matrix; the right captures correlations between input columns. The relevant object being approximated is the AdaGrad full-matrix preconditioner sgsgs\sum_s g_s g_s^\top on the vectorized gradient, upper-bounded in the Löwner order by LRL \otimes R. K-FAC, by contrast, targets the Fisher / Gauss-Newton matrix using a different factorization (input-activation covariance Kronecker pre-activation-gradient covariance). Both are block-Kronecker preconditioners but they approximate different second-order quantities.

The matrix fourth root L1/4L^{-1/4} arises from the operator-inequality chain described in the statement: the symmetric split of (LR)1/2(L \otimes R)^{-1/2} across the two factors. Empirically, intermediate exponents p(0,1/2]p \in (0, 1/2] trade off preconditioning strength against stability, with 1/41/4 and 1/21/2 both in active use.

Why It Matters

Shampoo has shown strong empirical results on transformer training, sometimes matching or exceeding Adam with fewer steps (though each step is more expensive due to the matrix power computation). Google's Distributed Shampoo (Anil et al., 2020) is the production variant: preconditioner statistics LL and RR are sharded across data-parallel workers, the matrix powers LpL^{-p} and RpR^{-p} are recomputed every kk steps (typically k=100k = 100 to 10001000) on dedicated CPU or GPU workers in parallel with training, and only the small matrix powers (not the full statistics) are communicated back. This amortizes the O(m3)O(m^3) eigendecomposition cost across many training steps and hides it behind the forward-backward pass. The connection to Riemannian optimization is direct: Shampoo's update can be interpreted as Riemannian gradient descent on the space of weight matrices with a specific metric.

Failure Mode

Computing L1/4L^{-1/4} requires eigendecomposition of LL, costing O(m3)O(m^3) per step. For large layers (m=4096m = 4096), this is expensive. Practical implementations amortize this cost by recomputing the preconditioner every kk steps (e.g., k=100k = 100). The matrix power can also be approximated using Newton-Schulz iterations (same technique as Muon). Memory cost is O(m2+n2)O(m^2 + n^2) per weight matrix for the preconditioners.

Comparison Table

OptimizerPreconditionerCost per stepMemory overheadConvergenceBest for
SGDNone (II)O(d)O(d)0Slow on ill-conditionedConvex, well-conditioned
AdamDiagonal (diag(vt)1/2\text{diag}(v_t)^{-1/2})O(d)O(d)2d2d (moments)Good general-purposeDefault choice
K-FACKronecker (A1S1A^{-1} \otimes S^{-1})O(din3+dout3)O(d_{\text{in}}^3 + d_{\text{out}}^3)din2+dout2d_{\text{in}}^2 + d_{\text{out}}^2Fast on FC layersLarge FC networks
ShampooFull matrix (L1/4R1/4L^{-1/4} \cdot R^{-1/4})O(m3+n3)O(m^3 + n^3)m2+n2m^2 + n^2Fast, especially on matricesTransformers, large models
MuonNewton-Schulz approx. of orthogonal polar factor of GGO(mn)O(mn) (Newton-Schulz)O(mn)O(mn)Very fast on matrix weightsTransformer weights
Parallel Muon / Turbo MuonMuon with sharded Newton-Schulz iterations across DP/TP groupsO(mn/P)O(mn / P) wall-clock for PP workersO(mn)O(mn)Comparable to Muon, lower wall-clock per stepLarge-scale distributed training
Natural gradientFull Fisher (F1F^{-1})O(d3)O(d^3)O(d2)O(d^2)Optimal in local quadratic / KL-geometry sense for exponential families; no general non-convex guaranteeInfeasible for large dd

Common Confusions

Watch Out

Shampoo is not just Adam with bigger matrices

Adam uses diagonal preconditioning: each parameter gets its own adaptive learning rate based on its own squared gradient history. Shampoo uses full-matrix preconditioning: the update to parameter wijw_{ij} depends on the gradient history of all other parameters in the same weight matrix. This captures cross-parameter correlations that Adam misses entirely. The difference matters most when parameters are highly correlated (which they are in attention weight matrices).

Watch Out

The matrix fourth root comes from a Löwner bound, not from vec-vs-matrix form

A common incorrect story: "the 1/4-1/4 exponent appears because vectorizing a matrix update introduces an extra square root." This is wrong. The Kronecker vec identity vec(LaGRa)=(RaLa)vec(G)\text{vec}(L^{a} G R^{a}) = (R^{a} \otimes L^{a}) \text{vec}(G) preserves exponents exactly; switching between matrix and vectorized form introduces nothing.

The actual argument (Gupta, Koren, Singer, 2018) is a Löwner-order bound. The Kronecker approximation LRL \otimes R upper-bounds the full AdaGrad preconditioner HH in the positive-definite order, so (LR)1/2H1/2(L \otimes R)^{-1/2} \preceq H^{-1/2}. Splitting this 1/2-1/2 symmetrically across the two factors yields the 1/4-1/4 exponent on each.

The exponent is also not uniquely determined. The Löwner bound fixes the total 1/2-1/2, but the symmetric split to 1/4-1/4 on each factor is a design choice. Much of the recent literature (Distributed Shampoo, SOAP, Morwani et al. 2024) uses 1/2-1/2 on each factor, with comparable or better results at scale. Treat the exponent as a tunable hyperparameter in the range p(0,1/2]p \in (0, 1/2], not as a uniquely derived constant.

Watch Out

More preconditioning is not always better

Full natural gradient (F1LF^{-1} \nabla \mathcal{L}) is theoretically optimal per step, but each step is vastly more expensive. The wall-clock time to reach a target loss depends on both the number of steps and the cost per step. Adam is cheap per step. Shampoo is expensive per step but takes fewer steps. The crossover depends on model size, hardware, and how ill-conditioned the problem is. For small models, Adam wins on wall-clock. For large transformers with many GPU-hours per step, Shampoo can win.

Exercises

ExerciseCore

Problem

Adam maintains running averages vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 where gt2g_t^2 is elementwise. Explain why this is equivalent to a diagonal approximation of the Fisher matrix and why it misses cross-parameter correlations.

ExerciseAdvanced

Problem

For a weight matrix WR512×768W \in \mathbb{R}^{512 \times 768}, compute the memory cost of Shampoo's preconditioners (LL and RR) compared to Adam's two moment buffers. At what matrix size does Shampoo's memory overhead exceed Adam's by more than 2x?

Related Comparisons

References

Canonical:

  • Amari, "Natural Gradient Works Efficiently in Learning" (Neural Computation, 1998). The foundational paper on natural gradient.
  • Martens & Grosse, "Optimizing Neural Networks with Kronecker-factored Approximate Curvature" (ICML 2015). K-FAC.
  • Gupta, Koren, Singer, "Shampoo: Preconditioned Stochastic Tensor Optimization" (ICML 2018). The original Shampoo paper.

Current:

  • Anil, Gupta, Koren, Regan, Singer, "Scalable Second Order Optimization for Deep Learning" (2020/2021). Distributed Shampoo; often uses p=1/2p = 1/2 on each factor.

  • Morwani, Shah, Cohen, Mhaskar, "A New Perspective on Shampoo's Preconditioner" (2024). Re-derives Shampoo-style updates and argues for 1/21/2 over 1/41/4 exponents.

  • Vyas et al., "SOAP: Improving and Stabilizing Shampoo using Adam" (2024). Hybrid preconditioner using 1/21/2 exponent on each factor.

  • Bernstein & Newhouse, "Old Optimizer, New Norm: An Anthology" (2024). Muon and the Newton-Schulz polar-factor connection.

  • Liu et al., "Parallel Muon for Scalable Distributed Pretraining" (2025), arXiv:2502.16982. Shards the Newton-Schulz iterations of Muon across data- and tensor-parallel groups so that the per-step optimizer cost matches Adam at large scale.

  • "Turbo Muon: Accelerating Muon via Polynomial-Iterate Reuse" (2025), arXiv:2512.04632. Reduces the constant factor of the Newton-Schulz polar-factor approximation by reusing intermediate iterates across consecutive optimizer steps.

  • Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5

Next Topics

  • Riemannian optimization: the geometric framework underlying manifold-constrained updates
  • Optimizer theory: SGD, Adam, Muon: how Muon approximates the orthogonal polar factor of the gradient via Newton-Schulz

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.