LLM Construction
Optimizer Theory: SGD, Adam, and Muon
Convergence theory of SGD (convex and strongly convex), momentum methods (Polyak and Nesterov), Adam as adaptive + momentum, why SGD can generalize better, the Muon optimizer, and learning rate schedules.
Prerequisites
Why This Matters
The optimizer is the algorithm that actually finds the parameters of a neural network. Every model is found by some variant of stochastic gradient descent. Understanding optimizer theory means understanding convergence rates (how fast can you reach a good solution?), the role of stochasticity (why is noise helpful?), and the deep connection between optimization and generalization (why do some optimizers find solutions that generalize better?).
This topic covers the theoretical landscape from classical SGD convergence theory through modern adaptive methods (Adam) to recent optimizers. Muon (Keller Jordan et al. 2024) is an experimental optimizer that has shown strong results on specific benchmarks (NanoGPT speedrun, some post-training work). It is not yet a production-default for large-scale LLM training. Include it here because the mathematical ideas, steepest descent under a matrix-norm geometry, are worth understanding even if the practical standard remains AdamW.
Mental Model
Gradient descent walks downhill on the loss surface. Stochastic gradient descent uses a noisy estimate of the gradient (from a mini-batch) instead of the true gradient. This noise is both a curse (convergence is slower) and a blessing (the noise helps escape sharp minima and find flatter regions that generalize better).
Momentum methods add inertia: the optimizer remembers its recent direction and continues moving that way, smoothing out the noise. Adaptive methods (Adam) additionally scale the step size per-parameter based on gradient history. The art of optimization is combining these ideas to converge fast and find solutions that generalize.
SGD Convergence Theory
Stochastic Gradient Descent
At iteration , SGD updates:
where is the gradient on a random sample (or mini-batch) , and is the learning rate.
The key assumption: is an unbiased estimator of the true gradient: , with bounded variance .
SGD Convergence for Convex Functions
Statement
For a convex, -smooth function with stochastic gradients of variance , SGD with constant learning rate achieves after iterations:
where is the averaged iterate.
With the optimal constant step size :
Intuition
There are two competing forces. The first term is the "optimization" term. It shrinks with more iterations and larger step sizes. The second term is the "noise" term. It grows with larger step sizes because the stochastic noise accumulates. The optimal step size balances these terms, giving the rate.
This rate is optimal for convex stochastic optimization—no algorithm using only stochastic first-order information can do better in the worst case.
Proof Sketch
By the update rule and convexity:
Taking expectations and using convexity ():
Using -smoothness to bound and telescoping the sum over steps gives the result.
Why It Matters
The rate explains why training neural networks requires many iterations. To halve the suboptimality, you need 4x more iterations. This is fundamental to stochastic optimization, not a limitation of SGD specifically. The rate also explains the importance of variance reduction: if you can reduce (by using larger batches or variance-reduction techniques), you can use larger step sizes and converge faster.
Failure Mode
This bound assumes convexity. Neural network losses are non-convex with many critical points, multiple local minima, and saddle points, so the bound does not apply directly. For non-convex problems, the analogous guarantee is convergence to a stationary point (), not to a global minimum. The practical success of SGD on non-convex neural networks is not fully explained by the convex theory.
SGD Convergence for Strongly Convex Functions
Statement
For a -strongly convex, -smooth function with stochastic gradients of variance , SGD with step size achieves:
This is a faster rate than the rate for convex functions, because strong convexity provides additional curvature that helps the optimizer converge.
Intuition
Strong convexity means the function curves upward at least quadratically near the optimum. This curvature provides a "restoring force" that pulls the iterates toward . Even with stochastic noise, the curvature dominates, giving linear convergence up to a noise floor. The decreasing step size reduces the noise contribution over time.
Why It Matters
Regularized objectives (e.g., ridge regression) are strongly convex, and the rate applies directly. For neural networks, local strong convexity near a minimum can partially explain why SGD converges faster in practice than the worst-case convex rate suggests.
Mini-Batch SGD and Variance Reduction
Mini-Batch SGD
Instead of using a single sample, estimate the gradient from a mini-batch of size :
The variance of the mini-batch gradient is . By the SGD convergence theorem, replacing with gives:
Increasing batch size by a factor of 4 is equivalent to running 4x more iterations at the same convergence rate. However, it does not improve the rate. The total number of gradient evaluations is what matters.
When does batch size help? Larger batches reduce gradient variance, but the compute cost per iteration increases proportionally. The critical batch size is the batch size at which the gradient noise is comparable to the signal. Below , increasing batch size improves efficiency (you waste less compute on noise). Above , further increases yield diminishing returns because the gradient is already accurate.
Momentum Methods
Polyak Heavy Ball Momentum
SGD with heavy ball momentum maintains a velocity vector:
where is the momentum coefficient (typically ).
The velocity is an exponential moving average of past gradients. This smooths the update direction: instead of responding only to the current gradient, the optimizer continues in the direction it has been moving.
Nesterov Accelerated Gradient Descent
Statement
Nesterov accelerated gradient descent evaluates the gradient at a "lookahead" point:
For convex, -smooth functions, this achieves:
This is the optimal rate for first-order methods on convex smooth functions (by the Nemirovsky-Yudin lower bound). Standard gradient descent achieves only .
Intuition
The key idea is to evaluate the gradient not at the current point but at where you expect to be after the momentum step. This "look ahead" allows the method to correct overshooting before it happens, rather than after. The momentum coefficient grows over time, gradually increasing the aggressiveness of the extrapolation.
Why It Matters
Nesterov acceleration demonstrates that momentum is not just a heuristic. it provably speeds up convergence for smooth convex optimization by a quadratic factor. In the stochastic setting, however, Nesterov acceleration does not improve the rate because the noise dominates the acceleration benefit. This is why practical deep learning mostly uses Polyak momentum (which is simpler) rather than Nesterov momentum.
Failure Mode
The rate requires exact gradients. With stochastic gradients, the benefit of acceleration is reduced because the noise variance is not affected by the momentum scheme. For stochastic optimization, Nesterov momentum offers at best a constant-factor improvement over Polyak momentum, not an asymptotic improvement.
Adam: Adaptive + Momentum
Adam combines first-moment estimation (momentum) with second-moment estimation (per-parameter adaptive learning rates). The full algorithm is covered in the Adam optimizer topic. Here we focus on the theoretical comparison with SGD.
Adam convergence: For convex problems, Adam with appropriate step size achieves convergence, matching SGD. However, Reddi et al. (2018) showed that Adam can diverge on simple convex problems because the adaptive learning rate can increase without bound when shrinks.
Adam vs SGD generalization: The key empirical finding (Wilson et al., 2017; Smith et al., 2021): SGD with momentum often finds solutions that generalize better than Adam, even when Adam converges faster on the training set.
Flat minima and generalization (heuristic). One line of evidence connecting optimizer choice to generalization goes through a PAC-Bayes-style heuristic. For a minimum at with Hessian , a simplified bound reads:
Read as a heuristic, this suggests that flatter minima (smaller Hessian trace) carry tighter generalization bounds, and that an optimizer which consistently finds flatter minima may produce models that generalize better.
This is not a canonical theorem. Dinh et al. (2017) showed that sharpness measures based on Hessian eigenvalues are reparameterization dependent: for any minimum, one can rescale parameters to make it arbitrarily sharp without changing the function or its generalization. Rigorous PAC-Bayes bounds for neural networks (Dziugaite and Roy 2017; Jiang et al. 2020) add further terms and assumptions (a fixed data-independent prior, specific posterior families), and the tightest known bounds still do not cleanly isolate as the controlling quantity. Treat the inequality above as an intuition pump, not a canonical result.
Empirically, SGD with large learning rate and small batch size injects more noise into the optimization, which is widely reported to correlate with flatter minima and better generalization on some vision benchmarks. Adam, by adapting the learning rate per parameter, often converges to sharper regions. This is one leading narrative for why SGD can beat Adam in generalization on certain tasks, but it is a narrative, not a proof.
Recent Variants
Three more recent Adam-family optimizers are worth naming. Lion (Chen et al. 2023, arXiv:2302.06675) replaces Adam's second moment with a sign of the momentum-smoothed gradient, matching AdamW at roughly half the optimizer memory. Sophia (Liu et al. 2023, arXiv:2305.14342) uses a diagonal Hessian (or Gauss-Newton) preconditioner with clipping and reports a roughly 2x wall-clock speedup on GPT-2-scale LLM pretraining. Adafactor (Shazeer and Stern 2018, arXiv:1804.04235) targets matrix-shaped parameters . Adam stores a full second-moment matrix at memory; Adafactor instead stores per-row and per-column statistics and reconstructs an approximate second moment from their outer product, cutting second-moment memory from to . This is why Adafactor is standard in T5-style training where the embedding and feed- forward matrices dominate optimizer state.
Implicit Regularization of SGD
Smith, Dherin, Barrett, and De (2021, arXiv:2101.12176) give a concrete handle on what SGD noise does. Using backward error analysis, they show that SGD with step size and batch size does not track the loss . At leading order it follows a modified loss
The extra term penalizes gradient norm, so SGD implicitly prefers regions where stays small as you move around, which lines up with the informal picture of flat minima. The noise magnitude scales as . The ratio of learning rate to batch size, not either alone, controls the strength of this implicit regularizer. This explains linear scaling rules and why very large batches often need extra regularization to recover the generalization of small-batch SGD.
The Muon Optimizer
Muon Optimizer
Muon (Momentum + Orthogonalization) is an optimizer designed specifically for hidden-layer weight matrices in neural networks. For a weight matrix :
- Compute the gradient
- Apply momentum:
- Orthogonalize the update using Newton-Schulz iterations to approximate the matrix sign function:
This projects the update onto the Stiefel manifold (the set of matrices with orthonormal columns/rows), ensuring that the update direction has equal magnitude across all singular value directions.
Muon applies this orthogonalized update to hidden-layer weights, while using Adam for embedding and normalization parameters.
Why orthogonalization? In standard optimizers, the update direction is dominated by the top singular vectors of the gradient. Low-magnitude singular directions. which may carry important structural information. are suppressed. By orthogonalizing, Muon ensures that all directions of the gradient are treated equally, regardless of their magnitude.
Empirical results: Muon achieved state-of-the-art results in the NanoGPT speedrun benchmark, reaching the target validation loss faster than AdamW. The key advantage appears to be in the early-to-mid phase of training, where Muon's orthogonalized updates help the model learn useful representations faster.
Newton-Schulz iterations: The orthogonalization step uses 5-10 iterations of to approximate the polar decomposition where is orthogonal. This is cheaper than computing a full SVD and is numerically stable.
Practical Considerations
Muon is most beneficial for transformer hidden-layer weights (the large matrices). It is not a drop-in replacement for Adam on every parameter shape: the original Muon recipe applies orthogonalized updates to hidden 2D weights and falls back to Adam for embeddings, normalization scales, and biases. The Newton-Schulz step is more expensive than a single Adam update, but the cost is small relative to the matmuls in a forward pass and is often outweighed by faster convergence in the early-to-mid phase of training.
Muon variants and recent results. The Muon literature is moving quickly. Four threads are worth naming:
- Moonlight (Liu et al. 2025, arXiv:2502.16982) scales Muon to a 16B-parameter MoE LLM trained on 5.7T tokens. Reports matching Qwen 2.5 and LLaMA 3 1T-token pretraining FLOPs at roughly 52% the training FLOPs. Adds weight-decay and RMS-matching tweaks that make Muon behave well without retuning learning rate versus an AdamW baseline.
- MuonClip / Kimi K2 (Moonshot AI 2026, arXiv:2507.20534) productionizes a clipped-update variant of Muon inside Kimi K2, a 1.04T-parameter MoE trained on 15.5T tokens with no observed loss spike. The clip is QK-Clip: per-head rescaling of query and key projection weights after every step whenever the per-head max attention logit exceeds threshold . The fix targets a Muon-specific failure mode where orthogonalized updates accelerate attention-logit growth that AdamW used to keep implicit; without QK-Clip the run diverges. This is the first widely-reported trillion-parameter pretrain using Muon on hidden weights.
- SOAP (Vyas et al. 2024, arXiv:2409.11321) connects Shampoo-style full-matrix preconditioning to Adam. It runs Adam in the eigenbasis of a slowly updated Shampoo preconditioner, giving most of Shampoo's geometric benefit at a fraction of the wall-clock cost. SOAP is the current strongest non-Muon preconditioned optimizer in the LLM regime.
- Theoretical grounding (Bernstein and Newhouse 2024, arXiv:2409.20325). Frames Muon's orthogonalized step as steepest descent under the spectral norm on weight matrices, and Shampoo under a Kronecker-factored norm. This gives a principled reason the two families look alike in practice and suggests that the right question is not "Muon versus Adam" but "which norm matches the geometry of the parameter space."
Learning Rate Schedules
The learning rate schedule is as important as the optimizer itself for LLM training:
Warmup + Cosine Decay Schedule
The standard LLM learning rate schedule has two phases:
Linear warmup (first steps):
Cosine decay (remaining steps):
Typical values: to , .
Why warmup: In the first steps, Adam's second-moment estimate is unreliable (based on very few gradient samples). Large learning rates with noisy denominator estimates cause wild parameter updates. Warmup gives the second-moment estimate time to stabilize.
Why cosine decay: As training progresses and the model approaches a minimum, the gradient noise increasingly dominates the signal. Reducing the learning rate allows the optimizer to settle into the minimum rather than bouncing around it. The cosine shape provides a smooth decay that is empirically superior to linear or step decay for language models.
Common Confusions
SGD convergence theory does not directly apply to neural networks
The and rates assume convexity (or strong convexity). Neural network loss functions are non-convex with saddle points, local minima, and plateaus. The practical success of SGD on neural networks is empirically well-established but not fully explained by classical convex theory. The theory gives useful intuition (larger batches = less noise, lower learning rate = smaller noise floor) but the quantitative bounds are vacuous for typical deep learning problems.
Momentum does not help asymptotically in the stochastic setting
Nesterov acceleration gives vs for deterministic convex optimization. a provable speedup. But for stochastic optimization, both momentum and no-momentum SGD achieve , and this is optimal. Momentum helps with the constant factors (smoother trajectory, faster initial convergence) but does not improve the asymptotic rate. The practical benefits of momentum in deep learning are real but not captured by worst-case convergence theory.
Adam is not universally better than SGD
Adam converges faster on the training loss for most problems. But faster training convergence does not imply better generalization. For CNNs on image classification, SGD with momentum consistently generalizes better than Adam. For transformers on language modeling, Adam (specifically AdamW) is standard because the generalization gap is small and the faster convergence matters. The optimal optimizer depends on the architecture and task.
Summary
- SGD convex rate: ; strongly convex:
- Nesterov acceleration: for deterministic convex (optimal), but no help in stochastic setting asymptotically
- Mini-batch reduces variance by but does not improve the fundamental rate
- Critical batch size : above this, larger batches waste compute
- Adam = momentum + adaptive LR; can converge to sharper minima than SGD
- Flat minima (small Hessian eigenvalues) correlate with better generalization
- Muon: orthogonalize gradient updates for hidden layers; all singular directions treated equally
- Learning rate schedule: warmup (stabilize Adam's second moment) + cosine decay (settle into minimum)
- SGD generalizes better on vision; Adam/AdamW standard for transformers
Exercises
Problem
For SGD on a convex -smooth function with gradient variance and initial distance , what is the optimal constant learning rate to minimize the bound after iterations? What suboptimality does the bound guarantee?
Problem
Prove that mini-batch SGD with batch size has gradient variance . Then explain why doubling the batch size and halving the number of iterations does not change the convergence bound.
Problem
The Muon optimizer orthogonalizes the gradient update using Newton-Schulz iterations to approximate . Why would this help compared to standard Adam updates? Consider a scenario where the gradient matrix has singular values and analyze what happens with Adam vs Muon.
Related Comparisons
References
Canonical:
- Polyak, "Some methods of speeding up the convergence of iteration methods" (1964). Heavy-ball momentum.
- Nesterov, "A method for unconstrained convex minimization problem with the rate of convergence " (1983).
- Kingma and Ba, "Adam: A Method for Stochastic Optimization" (2015), arXiv:1412.6980.
- Reddi, Kale and Kumar, "On the Convergence of Adam and Beyond" (ICLR 2018), arXiv:1904.09237. Introduces AMSGrad and the Adam divergence counterexample.
- Shazeer and Stern, "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost" (2018), arXiv:1804.04235.
Adam vs SGD and generalization:
- Wilson, Roelofs, Stern, Srebro and Recht, "The Marginal Value of Adaptive Gradient Methods in Machine Learning" (NeurIPS 2017), arXiv:1705.08292.
- Dinh, Pascanu, Bengio and Bengio, "Sharp Minima Can Generalize for Deep Nets" (ICML 2017), arXiv:1703.04933. Shows reparameterization sensitivity of Hessian-based sharpness.
- Dziugaite and Roy, "Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data" (UAI 2017), arXiv:1703.11008.
- Jiang, Neyshabur, Mobahi, Krishnan and Bengio, "Fantastic Generalization Measures and Where to Find Them" (ICLR 2020), arXiv:1912.02178.
- Smith, Dherin, Barrett and De, "On the Origin of Implicit Regularization in Stochastic Gradient Descent" (ICLR 2021), arXiv:2101.12176.
Modern variants:
- Chen et al., "Symbolic Discovery of Optimization Algorithms" (2023), arXiv:2302.06675. Lion optimizer.
- Liu et al., "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training" (2023), arXiv:2305.14342.
- Jordan, "Muon: An optimizer for the hidden layers of neural networks" (2024). The modded-nanogpt write-up.
- Vyas et al., "SOAP: Improving and Stabilizing Shampoo using Adam" (2024), arXiv:2409.11321.
- Bernstein and Newhouse, "Old Optimizer, New Norm: An Anthology" (2024), arXiv:2409.20325.
- Liu et al., "Moonlight: Scaling Muon to 16B-parameter MoE Language Models" (2025), arXiv:2502.16982.
Next Topics
Optimizer theory connects to:
- Scaling laws: how optimizer choice and learning rate schedule affect compute-optimal training
- Preconditioned optimizers: Shampoo, K-FAC, and the full-matrix preconditioning story
- Riemannian optimization: the geometric framework underlying Muon's orthogonalization
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
8- Automatic Differentiationlayer 1 · tier 1
- Convex Optimization Basicslayer 1 · tier 1
- Gradient Descent Variantslayer 1 · tier 1
- Adam Optimizerlayer 2 · tier 1
- Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradientlayer 3 · tier 2
Derived topics
4- Scaling Lawslayer 4 · tier 1
- Continual Learning and Forgettinglayer 3 · tier 2
- Federated Learninglayer 3 · tier 2
- Distributed Training Theorylayer 5 · tier 3
Graph-backed continuations