Skip to main content

LLM Construction

Optimizer Theory: SGD, Adam, and Muon

Convergence theory of SGD (convex and strongly convex), momentum methods (Polyak and Nesterov), Adam as adaptive + momentum, why SGD can generalize better, the Muon optimizer, and learning rate schedules.

AdvancedTier 1CurrentSupporting~70 min

Why This Matters

Optimizer Update Rules ComparedOptimizerUpdate RuleTrajectory ShapeSGDθ -= η · gzigzagSGD + Momentumv = β·v + gθ -= η·vsmooth curveAdamm = β₁·m + (1-β₁)·gv = β₂·v + (1-β₂)·g²θ -= η·m̂ / (√v̂ + ε)adaptive stepMuonG̃ = orthogonalize(G)via Newton-Schulz iterationθ -= η·μ·G̃ + momentumnear-directg = gradient, η = learning rate, β = momentum coefficient, m̂/v̂ = bias-corrected estimates

The optimizer is the algorithm that actually finds the parameters of a neural network. Every model is found by some variant of stochastic gradient descent. Understanding optimizer theory means understanding convergence rates (how fast can you reach a good solution?), the role of stochasticity (why is noise helpful?), and the deep connection between optimization and generalization (why do some optimizers find solutions that generalize better?).

This topic covers the theoretical landscape from classical SGD convergence theory through modern adaptive methods (Adam) to recent optimizers. Muon (Keller Jordan et al. 2024) is an experimental optimizer that has shown strong results on specific benchmarks (NanoGPT speedrun, some post-training work). It is not yet a production-default for large-scale LLM training. Include it here because the mathematical ideas, steepest descent under a matrix-norm geometry, are worth understanding even if the practical standard remains AdamW.

Mental Model

Gradient descent walks downhill on the loss surface. Stochastic gradient descent uses a noisy estimate of the gradient (from a mini-batch) instead of the true gradient. This noise is both a curse (convergence is slower) and a blessing (the noise helps escape sharp minima and find flatter regions that generalize better).

Momentum methods add inertia: the optimizer remembers its recent direction and continues moving that way, smoothing out the noise. Adaptive methods (Adam) additionally scale the step size per-parameter based on gradient history. The art of optimization is combining these ideas to converge fast and find solutions that generalize.

SGD Convergence Theory

Definition

Stochastic Gradient Descent

At iteration tt, SGD updates:

θt+1=θtηtg^t\theta_{t+1} = \theta_t - \eta_t \hat{g}_t

where g^t=θ(θt;zt)\hat{g}_t = \nabla_\theta \ell(\theta_t; z_t) is the gradient on a random sample (or mini-batch) ztz_t, and ηt\eta_t is the learning rate.

The key assumption: g^t\hat{g}_t is an unbiased estimator of the true gradient: E[g^tθt]=f(θt)\mathbb{E}[\hat{g}_t | \theta_t] = \nabla f(\theta_t), with bounded variance E[g^tf(θt)2]σ2\mathbb{E}[\|\hat{g}_t - \nabla f(\theta_t)\|^2] \leq \sigma^2.

Theorem

SGD Convergence for Convex Functions

Statement

For a convex, LL-smooth function ff with stochastic gradients of variance σ2\sigma^2, SGD with constant learning rate η1/L\eta \leq 1/L achieves after TT iterations:

E[f(θˉT)]f(θ)θ0θ22ηT+ησ22\mathbb{E}[f(\bar{\theta}_T)] - f(\theta^*) \leq \frac{\|\theta_0 - \theta^*\|^2}{2\eta T} + \frac{\eta \sigma^2}{2}

where θˉT=1Tt=1Tθt\bar{\theta}_T = \frac{1}{T}\sum_{t=1}^{T} \theta_t is the averaged iterate.

With the optimal constant step size η=θ0θσT\eta = \frac{\|\theta_0 - \theta^*\|}{\sigma\sqrt{T}}:

E[f(θˉT)]f(θ)θ0θσT=O(1T)\mathbb{E}[f(\bar{\theta}_T)] - f(\theta^*) \leq \frac{\|\theta_0 - \theta^*\| \sigma}{\sqrt{T}} = O\left(\frac{1}{\sqrt{T}}\right)

Intuition

There are two competing forces. The first term θ0θ2/(2ηT)\|\theta_0 - \theta^*\|^2/(2\eta T) is the "optimization" term. It shrinks with more iterations and larger step sizes. The second term ησ2/2\eta\sigma^2/2 is the "noise" term. It grows with larger step sizes because the stochastic noise accumulates. The optimal step size balances these terms, giving the O(1/T)O(1/\sqrt{T}) rate.

This rate is optimal for convex stochastic optimization—no algorithm using only stochastic first-order information can do better in the worst case.

Proof Sketch

By the update rule and convexity:

θt+1θ2=θtθ22ηg^t(θtθ)+η2g^t2\|\theta_{t+1} - \theta^*\|^2 = \|\theta_t - \theta^*\|^2 - 2\eta\hat{g}_t^\top(\theta_t - \theta^*) + \eta^2\|\hat{g}_t\|^2

Taking expectations and using convexity (f(θt)(θtθ)f(θt)f(θ)\nabla f(\theta_t)^\top(\theta_t - \theta^*) \geq f(\theta_t) - f(\theta^*)):

E[θt+1θ2]E[θtθ2]2η(f(θt)f(θ))+η2(f(θt)2+σ2)\mathbb{E}[\|\theta_{t+1} - \theta^*\|^2] \leq \mathbb{E}[\|\theta_t - \theta^*\|^2] - 2\eta(f(\theta_t) - f(\theta^*)) + \eta^2(\|\nabla f(\theta_t)\|^2 + \sigma^2)

Using LL-smoothness to bound f2\|\nabla f\|^2 and telescoping the sum over TT steps gives the result.

Why It Matters

The O(1/T)O(1/\sqrt{T}) rate explains why training neural networks requires many iterations. To halve the suboptimality, you need 4x more iterations. This is fundamental to stochastic optimization, not a limitation of SGD specifically. The rate also explains the importance of variance reduction: if you can reduce σ2\sigma^2 (by using larger batches or variance-reduction techniques), you can use larger step sizes and converge faster.

Failure Mode

This bound assumes convexity. Neural network losses are non-convex with many critical points, multiple local minima, and saddle points, so the bound does not apply directly. For non-convex problems, the analogous guarantee is convergence to a stationary point (fϵ\|\nabla f\| \leq \epsilon), not to a global minimum. The practical success of SGD on non-convex neural networks is not fully explained by the convex theory.

Theorem

SGD Convergence for Strongly Convex Functions

Statement

For a μ\mu-strongly convex, LL-smooth function with stochastic gradients of variance σ2\sigma^2, SGD with step size ηt=2μ(t+1)\eta_t = \frac{2}{\mu(t+1)} achieves:

E[f(θT)]f(θ)2Lσ2μ2T=O(1T)\mathbb{E}[f(\theta_T)] - f(\theta^*) \leq \frac{2L\sigma^2}{\mu^2 T} = O\left(\frac{1}{T}\right)

This is a faster rate than the O(1/T)O(1/\sqrt{T}) rate for convex functions, because strong convexity provides additional curvature that helps the optimizer converge.

Intuition

Strong convexity means the function curves upward at least quadratically near the optimum. This curvature provides a "restoring force" that pulls the iterates toward θ\theta^*. Even with stochastic noise, the curvature dominates, giving linear convergence up to a noise floor. The decreasing step size ηt1/t\eta_t \propto 1/t reduces the noise contribution over time.

Why It Matters

Regularized objectives (e.g., ridge regression) are strongly convex, and the O(1/T)O(1/T) rate applies directly. For neural networks, local strong convexity near a minimum can partially explain why SGD converges faster in practice than the worst-case convex rate suggests.

Mini-Batch SGD and Variance Reduction

Definition

Mini-Batch SGD

Instead of using a single sample, estimate the gradient from a mini-batch BtB_t of size bb:

g^t=1bzBt(θt;z)\hat{g}_t = \frac{1}{b}\sum_{z \in B_t} \nabla \ell(\theta_t; z)

The variance of the mini-batch gradient is Var(g^t)=σ2/b\text{Var}(\hat{g}_t) = \sigma^2/b. By the SGD convergence theorem, replacing σ2\sigma^2 with σ2/b\sigma^2/b gives:

E[f(θˉT)]f(θ)O(σbT)\mathbb{E}[f(\bar{\theta}_T)] - f(\theta^*) \leq O\left(\frac{\sigma}{\sqrt{bT}}\right)

Increasing batch size by a factor of 4 is equivalent to running 4x more iterations at the same convergence rate. However, it does not improve the rate. The total number of gradient evaluations bTbT is what matters.

When does batch size help? Larger batches reduce gradient variance, but the compute cost per iteration increases proportionally. The critical batch size b=σ2/f2b^* = \sigma^2 / \|\nabla f\|^2 is the batch size at which the gradient noise is comparable to the signal. Below bb^*, increasing batch size improves efficiency (you waste less compute on noise). Above bb^*, further increases yield diminishing returns because the gradient is already accurate.

Momentum Methods

Definition

Polyak Heavy Ball Momentum

SGD with heavy ball momentum maintains a velocity vector:

vt+1=βvt+g^tv_{t+1} = \beta v_t + \hat{g}_t θt+1=θtηvt+1\theta_{t+1} = \theta_t - \eta v_{t+1}

where β[0,1)\beta \in [0, 1) is the momentum coefficient (typically β=0.9\beta = 0.9).

The velocity vtv_t is an exponential moving average of past gradients. This smooths the update direction: instead of responding only to the current gradient, the optimizer continues in the direction it has been moving.

Theorem

Nesterov Accelerated Gradient Descent

Statement

Nesterov accelerated gradient descent evaluates the gradient at a "lookahead" point:

yt=θt+t1t+2(θtθt1)y_t = \theta_t + \frac{t-1}{t+2}(\theta_t - \theta_{t-1}) θt+1=yt1Lf(yt)\theta_{t+1} = y_t - \frac{1}{L}\nabla f(y_t)

For convex, LL-smooth functions, this achieves:

f(θT)f(θ)2Lθ0θ2T2=O(1T2)f(\theta_T) - f(\theta^*) \leq \frac{2L\|\theta_0 - \theta^*\|^2}{T^2} = O\left(\frac{1}{T^2}\right)

This is the optimal rate for first-order methods on convex smooth functions (by the Nemirovsky-Yudin lower bound). Standard gradient descent achieves only O(1/T)O(1/T).

Intuition

The key idea is to evaluate the gradient not at the current point but at where you expect to be after the momentum step. This "look ahead" allows the method to correct overshooting before it happens, rather than after. The momentum coefficient t1t+2\frac{t-1}{t+2} grows over time, gradually increasing the aggressiveness of the extrapolation.

Why It Matters

Nesterov acceleration demonstrates that momentum is not just a heuristic. it provably speeds up convergence for smooth convex optimization by a quadratic factor. In the stochastic setting, however, Nesterov acceleration does not improve the O(1/T)O(1/\sqrt{T}) rate because the noise dominates the acceleration benefit. This is why practical deep learning mostly uses Polyak momentum (which is simpler) rather than Nesterov momentum.

Failure Mode

The O(1/T2)O(1/T^2) rate requires exact gradients. With stochastic gradients, the benefit of acceleration is reduced because the noise variance is not affected by the momentum scheme. For stochastic optimization, Nesterov momentum offers at best a constant-factor improvement over Polyak momentum, not an asymptotic improvement.

Adam: Adaptive + Momentum

Adam combines first-moment estimation (momentum) with second-moment estimation (per-parameter adaptive learning rates). The full algorithm is covered in the Adam optimizer topic. Here we focus on the theoretical comparison with SGD.

Adam convergence: For convex problems, Adam with appropriate step size achieves O(1/T)O(1/\sqrt{T}) convergence, matching SGD. However, Reddi et al. (2018) showed that Adam can diverge on simple convex problems because the adaptive learning rate can increase without bound when vtv_t shrinks.

Adam vs SGD generalization: The key empirical finding (Wilson et al., 2017; Smith et al., 2021): SGD with momentum often finds solutions that generalize better than Adam, even when Adam converges faster on the training set.

Flat minima and generalization (heuristic). One line of evidence connecting optimizer choice to generalization goes through a PAC-Bayes-style heuristic. For a minimum at θ\theta^* with Hessian HH, a simplified bound reads:

generalization gapO(tr(H)θ2n)\text{generalization gap} \lesssim O\left(\sqrt{\frac{\text{tr}(H) \cdot \|\theta^*\|^2}{n}}\right)

Read as a heuristic, this suggests that flatter minima (smaller Hessian trace) carry tighter generalization bounds, and that an optimizer which consistently finds flatter minima may produce models that generalize better.

This is not a canonical theorem. Dinh et al. (2017) showed that sharpness measures based on Hessian eigenvalues are reparameterization dependent: for any minimum, one can rescale parameters to make it arbitrarily sharp without changing the function or its generalization. Rigorous PAC-Bayes bounds for neural networks (Dziugaite and Roy 2017; Jiang et al. 2020) add further terms and assumptions (a fixed data-independent prior, specific posterior families), and the tightest known bounds still do not cleanly isolate tr(H)\text{tr}(H) as the controlling quantity. Treat the inequality above as an intuition pump, not a canonical result.

Empirically, SGD with large learning rate and small batch size injects more noise into the optimization, which is widely reported to correlate with flatter minima and better generalization on some vision benchmarks. Adam, by adapting the learning rate per parameter, often converges to sharper regions. This is one leading narrative for why SGD can beat Adam in generalization on certain tasks, but it is a narrative, not a proof.

Recent Variants

Three more recent Adam-family optimizers are worth naming. Lion (Chen et al. 2023, arXiv:2302.06675) replaces Adam's second moment with a sign of the momentum-smoothed gradient, matching AdamW at roughly half the optimizer memory. Sophia (Liu et al. 2023, arXiv:2305.14342) uses a diagonal Hessian (or Gauss-Newton) preconditioner with clipping and reports a roughly 2x wall-clock speedup on GPT-2-scale LLM pretraining. Adafactor (Shazeer and Stern 2018, arXiv:1804.04235) targets matrix-shaped parameters WRm×nW \in \mathbb{R}^{m \times n}. Adam stores a full second-moment matrix at O(mn)O(mn) memory; Adafactor instead stores per-row and per-column statistics and reconstructs an approximate second moment from their outer product, cutting second-moment memory from O(mn)O(mn) to O(m+n)O(m + n). This is why Adafactor is standard in T5-style training where the embedding and feed- forward matrices dominate optimizer state.

Implicit Regularization of SGD

Smith, Dherin, Barrett, and De (2021, arXiv:2101.12176) give a concrete handle on what SGD noise does. Using backward error analysis, they show that SGD with step size η\eta and batch size BB does not track the loss LL. At leading order it follows a modified loss

L~(θ)=L(θ)+η4BL(θ)2.\tilde L(\theta) = L(\theta) + \frac{\eta}{4B} \|\nabla L(\theta)\|^2.

The extra term penalizes gradient norm, so SGD implicitly prefers regions where L\|\nabla L\| stays small as you move around, which lines up with the informal picture of flat minima. The noise magnitude scales as η/B\eta/B. The ratio of learning rate to batch size, not either alone, controls the strength of this implicit regularizer. This explains linear scaling rules and why very large batches often need extra regularization to recover the generalization of small-batch SGD.

The Muon Optimizer

Definition

Muon Optimizer

Muon (Momentum + Orthogonalization) is an optimizer designed specifically for hidden-layer weight matrices in neural networks. For a weight matrix WRm×nW \in \mathbb{R}^{m \times n}:

  1. Compute the gradient Gt=WLG_t = \nabla_W \mathcal{L}
  2. Apply momentum: Mt=βMt1+GtM_t = \beta M_{t-1} + G_t
  3. Orthogonalize the update using Newton-Schulz iterations to approximate the matrix sign function:

UtMt(MtMt)1/2U_t \approx M_t (M_t^\top M_t)^{-1/2}

This projects the update onto the Stiefel manifold (the set of matrices with orthonormal columns/rows), ensuring that the update direction has equal magnitude across all singular value directions.

Muon applies this orthogonalized update to hidden-layer weights, while using Adam for embedding and normalization parameters.

Why orthogonalization? In standard optimizers, the update direction is dominated by the top singular vectors of the gradient. Low-magnitude singular directions. which may carry important structural information. are suppressed. By orthogonalizing, Muon ensures that all directions of the gradient are treated equally, regardless of their magnitude.

Empirical results: Muon achieved state-of-the-art results in the NanoGPT speedrun benchmark, reaching the target validation loss faster than AdamW. The key advantage appears to be in the early-to-mid phase of training, where Muon's orthogonalized updates help the model learn useful representations faster.

Newton-Schulz iterations: The orthogonalization step uses 5-10 iterations of Xk+1=32Xk12XkXkXkX_{k+1} = \frac{3}{2}X_k - \frac{1}{2}X_k X_k^\top X_k to approximate the polar decomposition M=USM = US where UU is orthogonal. This is cheaper than computing a full SVD and is numerically stable.

Practical Considerations

Muon is most beneficial for transformer hidden-layer weights (the large dmodel×dmodeld_{\text{model}} \times d_{\text{model}} matrices). It is not a drop-in replacement for Adam on every parameter shape: the original Muon recipe applies orthogonalized updates to hidden 2D weights and falls back to Adam for embeddings, normalization scales, and biases. The Newton-Schulz step is more expensive than a single Adam update, but the cost is small relative to the matmuls in a forward pass and is often outweighed by faster convergence in the early-to-mid phase of training.

Muon variants and recent results. The Muon literature is moving quickly. Four threads are worth naming:

  • Moonlight (Liu et al. 2025, arXiv:2502.16982) scales Muon to a 16B-parameter MoE LLM trained on 5.7T tokens. Reports matching Qwen 2.5 and LLaMA 3 1T-token pretraining FLOPs at roughly 52% the training FLOPs. Adds weight-decay and RMS-matching tweaks that make Muon behave well without retuning learning rate versus an AdamW baseline.
  • MuonClip / Kimi K2 (Moonshot AI 2026, arXiv:2507.20534) productionizes a clipped-update variant of Muon inside Kimi K2, a 1.04T-parameter MoE trained on 15.5T tokens with no observed loss spike. The clip is QK-Clip: per-head rescaling of query and key projection weights after every step whenever the per-head max attention logit SmaxhS^h_{\max} exceeds threshold τ=100\tau = 100. The fix targets a Muon-specific failure mode where orthogonalized updates accelerate attention-logit growth that AdamW used to keep implicit; without QK-Clip the run diverges. This is the first widely-reported trillion-parameter pretrain using Muon on hidden weights.
  • SOAP (Vyas et al. 2024, arXiv:2409.11321) connects Shampoo-style full-matrix preconditioning to Adam. It runs Adam in the eigenbasis of a slowly updated Shampoo preconditioner, giving most of Shampoo's geometric benefit at a fraction of the wall-clock cost. SOAP is the current strongest non-Muon preconditioned optimizer in the LLM regime.
  • Theoretical grounding (Bernstein and Newhouse 2024, arXiv:2409.20325). Frames Muon's orthogonalized step as steepest descent under the spectral norm on weight matrices, and Shampoo under a Kronecker-factored norm. This gives a principled reason the two families look alike in practice and suggests that the right question is not "Muon versus Adam" but "which norm matches the geometry of the parameter space."

Learning Rate Schedules

The learning rate schedule is as important as the optimizer itself for LLM training:

Definition

Warmup + Cosine Decay Schedule

The standard LLM learning rate schedule has two phases:

Linear warmup (first TwT_w steps): ηt=ηmaxt/Tw\eta_t = \eta_{\max} \cdot t / T_w

Cosine decay (remaining steps): ηt=ηmin+12(ηmaxηmin)(1+cos(πtTwTTw))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\pi \cdot \frac{t - T_w}{T - T_w}))

Typical values: Tw0.01TT_w \approx 0.01T to 0.1T0.1T, ηmin=0.1ηmax\eta_{\min} = 0.1 \eta_{\max}.

Why warmup: In the first steps, Adam's second-moment estimate is unreliable (based on very few gradient samples). Large learning rates with noisy denominator estimates cause wild parameter updates. Warmup gives the second-moment estimate time to stabilize.

Why cosine decay: As training progresses and the model approaches a minimum, the gradient noise increasingly dominates the signal. Reducing the learning rate allows the optimizer to settle into the minimum rather than bouncing around it. The cosine shape provides a smooth decay that is empirically superior to linear or step decay for language models.

Common Confusions

Watch Out

SGD convergence theory does not directly apply to neural networks

The O(1/T)O(1/\sqrt{T}) and O(1/T)O(1/T) rates assume convexity (or strong convexity). Neural network loss functions are non-convex with saddle points, local minima, and plateaus. The practical success of SGD on neural networks is empirically well-established but not fully explained by classical convex theory. The theory gives useful intuition (larger batches = less noise, lower learning rate = smaller noise floor) but the quantitative bounds are vacuous for typical deep learning problems.

Watch Out

Momentum does not help asymptotically in the stochastic setting

Nesterov acceleration gives O(1/T2)O(1/T^2) vs O(1/T)O(1/T) for deterministic convex optimization. a provable speedup. But for stochastic optimization, both momentum and no-momentum SGD achieve O(1/T)O(1/\sqrt{T}), and this is optimal. Momentum helps with the constant factors (smoother trajectory, faster initial convergence) but does not improve the asymptotic rate. The practical benefits of momentum in deep learning are real but not captured by worst-case convergence theory.

Watch Out

Adam is not universally better than SGD

Adam converges faster on the training loss for most problems. But faster training convergence does not imply better generalization. For CNNs on image classification, SGD with momentum consistently generalizes better than Adam. For transformers on language modeling, Adam (specifically AdamW) is standard because the generalization gap is small and the faster convergence matters. The optimal optimizer depends on the architecture and task.

Summary

  • SGD convex rate: O(1/T)O(1/\sqrt{T}); strongly convex: O(1/T)O(1/T)
  • Nesterov acceleration: O(1/T2)O(1/T^2) for deterministic convex (optimal), but no help in stochastic setting asymptotically
  • Mini-batch reduces variance by 1/b1/b but does not improve the fundamental rate
  • Critical batch size b=σ2/f2b^* = \sigma^2/\|\nabla f\|^2: above this, larger batches waste compute
  • Adam = momentum + adaptive LR; can converge to sharper minima than SGD
  • Flat minima (small Hessian eigenvalues) correlate with better generalization
  • Muon: orthogonalize gradient updates for hidden layers; all singular directions treated equally
  • Learning rate schedule: warmup (stabilize Adam's second moment) + cosine decay (settle into minimum)
  • SGD generalizes better on vision; Adam/AdamW standard for transformers

Exercises

ExerciseCore

Problem

For SGD on a convex LL-smooth function with gradient variance σ2=1\sigma^2 = 1 and initial distance θ0θ=10\|\theta_0 - \theta^*\| = 10, what is the optimal constant learning rate to minimize the bound after T=10,000T = 10{,}000 iterations? What suboptimality does the bound guarantee?

ExerciseAdvanced

Problem

Prove that mini-batch SGD with batch size bb has gradient variance σ2/b\sigma^2/b. Then explain why doubling the batch size and halving the number of iterations does not change the convergence bound.

ExerciseResearch

Problem

The Muon optimizer orthogonalizes the gradient update using Newton-Schulz iterations to approximate M(MM)1/2M (M^\top M)^{-1/2}. Why would this help compared to standard Adam updates? Consider a scenario where the gradient matrix has singular values [100,1,0.01][100, 1, 0.01] and analyze what happens with Adam vs Muon.

Related Comparisons

References

Canonical:

  • Polyak, "Some methods of speeding up the convergence of iteration methods" (1964). Heavy-ball momentum.
  • Nesterov, "A method for unconstrained convex minimization problem with the rate of convergence O(1/k2)O(1/k^2)" (1983).
  • Kingma and Ba, "Adam: A Method for Stochastic Optimization" (2015), arXiv:1412.6980.
  • Reddi, Kale and Kumar, "On the Convergence of Adam and Beyond" (ICLR 2018), arXiv:1904.09237. Introduces AMSGrad and the Adam divergence counterexample.
  • Shazeer and Stern, "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost" (2018), arXiv:1804.04235.

Adam vs SGD and generalization:

  • Wilson, Roelofs, Stern, Srebro and Recht, "The Marginal Value of Adaptive Gradient Methods in Machine Learning" (NeurIPS 2017), arXiv:1705.08292.
  • Dinh, Pascanu, Bengio and Bengio, "Sharp Minima Can Generalize for Deep Nets" (ICML 2017), arXiv:1703.04933. Shows reparameterization sensitivity of Hessian-based sharpness.
  • Dziugaite and Roy, "Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data" (UAI 2017), arXiv:1703.11008.
  • Jiang, Neyshabur, Mobahi, Krishnan and Bengio, "Fantastic Generalization Measures and Where to Find Them" (ICLR 2020), arXiv:1912.02178.
  • Smith, Dherin, Barrett and De, "On the Origin of Implicit Regularization in Stochastic Gradient Descent" (ICLR 2021), arXiv:2101.12176.

Modern variants:

  • Chen et al., "Symbolic Discovery of Optimization Algorithms" (2023), arXiv:2302.06675. Lion optimizer.
  • Liu et al., "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training" (2023), arXiv:2305.14342.
  • Jordan, "Muon: An optimizer for the hidden layers of neural networks" (2024). The modded-nanogpt write-up.
  • Vyas et al., "SOAP: Improving and Stabilizing Shampoo using Adam" (2024), arXiv:2409.11321.
  • Bernstein and Newhouse, "Old Optimizer, New Norm: An Anthology" (2024), arXiv:2409.20325.
  • Liu et al., "Moonlight: Scaling Muon to 16B-parameter MoE Language Models" (2025), arXiv:2502.16982.

Next Topics

Optimizer theory connects to:

  • Scaling laws: how optimizer choice and learning rate schedule affect compute-optimal training
  • Preconditioned optimizers: Shampoo, K-FAC, and the full-matrix preconditioning story
  • Riemannian optimization: the geometric framework underlying Muon's orthogonalization

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

8

Derived topics

4