Optimization Function Classes
Gradient Descent Variants
From full-batch to stochastic to mini-batch gradient descent, plus momentum, Nesterov acceleration, AdaGrad, RMSProp, and Adam. Why mini-batch SGD with momentum is the practical default.
Prerequisites
Why This Matters
Every neural network is trained by some variant of gradient descent (or gradient ascent when maximizing a reward or likelihood). The choice of optimizer affects convergence speed, final model quality, and generalization. Understanding the differences between SGD, momentum, and Adam is necessary for debugging training and making informed choices.
Full-Batch Gradient Descent
Full-Batch Gradient Descent
Given a loss function , full-batch gradient descent updates:
where is the learning rate and is computed using all training examples.
Full-batch GD computes the exact gradient. Each step is expensive ( per update), and for large this is impractical. A single epoch requires one gradient computation and one parameter update.
Stochastic Gradient Descent
Stochastic Gradient Descent (SGD)
At each step, sample a single example uniformly at random and update:
The stochastic gradient is an unbiased estimate of the full gradient: .
SGD is cheap per step ( per update) but noisy. The noise can be beneficial: it helps escape shallow local minima and saddle points. But it also means the iterates oscillate and do not converge to an exact minimum without a decreasing learning rate schedule.
Mini-Batch SGD
Mini-Batch Gradient Descent
At each step, sample a mini-batch of size and update:
Typical batch sizes: 32, 64, 128, 256, or 512.
Mini-batch SGD is the practical default. It balances the noise reduction of averaging over multiple samples with the computational efficiency of not using all samples. Larger batches reduce variance (smoother updates) but cost more per step. Smaller batches add noise but allow more updates per epoch.
SGD Convergence for Smooth Convex Functions
Statement
Let be convex and -smooth. Let the stochastic gradient have bounded variance: . With learning rate , after steps of SGD:
The convergence rate is , which is slower than the rate of full-batch GD for smooth convex functions.
Intuition
The term is the price of stochasticity — no matter how many steps you take, the noise in the gradient prevents exact convergence with a fixed learning rate. You need a decreasing schedule () for exact convergence, which slows things further.
Proof Sketch
Apply the descent lemma: . Substitute the SGD update, take expectations, and telescope the sum. The variance term prevents the telescoping from collapsing to zero.
Why It Matters
This rate explains why SGD is slow in theory but fast in practice. The constant hidden in the notation depends on for full-batch but not for SGD. For large , SGD reaches a given accuracy much sooner in wall clock time despite the worse rate per iteration.
Failure Mode
The bound requires convexity. For non-convex objectives (neural networks), SGD convergence to a stationary point (not a minimum) has rate under smoothness, but there is no guarantee the stationary point is good.
Stationary Points in Nonconvex Optimization
For nonconvex , not every stationary point is a local minimum. Saddle points satisfy but have directions of negative curvature. The standard taxonomy:
First-Order Stationary Point (FOSP)
A point is a first-order stationary point of a differentiable if and only if . An -FOSP satisfies . FOSPs include local minima, local maxima, and saddle points.
Second-Order Stationary Point (SOSP)
A point is a second-order stationary point of a twice-differentiable if and only if and (no negative eigenvalue). An -SOSP satisfies and for a Hessian-Lipschitz constant . Saddle points are FOSPs but not SOSPs.
Plain gradient descent can get stuck at strict saddle points. Jin, Ge, Netrapalli, Kakade, Jordan (2017), "How to Escape Saddle Points Efficiently", showed that perturbed gradient descent (adding occasional isotropic noise) finds an -SOSP in iterations for Hessian-Lipschitz objectives. This matches the rate for finding an -FOSP up to polylogarithmic factors, so escaping saddles is nearly free.
Momentum
SGD with Momentum
Momentum adds a velocity term that accumulates past gradients:
where is the momentum coefficient, typically 0.9.
Momentum smooths out oscillations in directions with high curvature and accelerates progress in directions with consistent gradient. It acts like a heavy ball rolling downhill: it keeps moving even when the local gradient is small.
Nesterov Momentum
Nesterov's modification computes the gradient at the "look-ahead" position:
The idea: evaluate the gradient where momentum would take you, not where you currently are. This provides a corrective signal that reduces overshooting.
Adaptive Learning Rate Methods
AdaGrad
AdaGrad
AdaGrad scales the learning rate per parameter by the accumulated squared gradients:
where , is element-wise multiplication, and prevents division by zero.
AdaGrad gives larger updates to infrequent parameters and smaller updates to frequent ones. This is good for sparse features. The problem: only grows, so the effective learning rate monotonically decreases and eventually becomes too small to make progress.
RMSProp
RMSProp fixes AdaGrad's decaying learning rate by using an exponential moving average instead of an accumulation:
where . The running average forgets old gradients, so the effective learning rate adapts to recent gradient magnitudes rather than all past gradients.
Adam
Adam Optimizer
Statement
Adam combines momentum (first moment) and RMSProp (second moment) with bias correction. Let :
Default hyperparameters: , , , .
Intuition
Adam adapts the learning rate per parameter using the ratio of the first moment (gradient direction) to the square root of the second moment (gradient magnitude). Parameters with consistently large gradients get smaller effective learning rates; parameters with small gradients get larger ones. The bias correction accounts for the zero initialization of and .
Proof Sketch
The bias correction ensures and under stationarity. Without correction, the initial estimates are biased toward zero because and .
Why It Matters
Adam is the most widely used optimizer for deep learning. It converges reliably with the default hyperparameters for Transformer language models at standard scales (Kingma and Ba, 2015) and for most supervised deep learning tasks. Its per-parameter adaptive learning rate means it requires less tuning than SGD with momentum.
Failure Mode
Adam can converge to worse solutions than SGD with momentum on some problems, particularly in computer vision (ResNets on ImageNet). The adaptive learning rate can be too aggressive, leading to flat minima that generalize poorly. AdamW (Adam with decoupled weight decay) fixes some of these issues.
Choosing an Optimizer
| Method | Update | Best for | Watch out for |
|---|---|---|---|
| SGD | Convex objectives; small models where every FLOP matters | Oscillation around the optimum; needs a decreasing schedule for exact convergence | |
| SGD + momentum | Vision with strong augmentation; settings where you can afford LR tuning | Wrong overshoots; momentum hides instability | |
| AdaGrad | Per-parameter LR scaled by | Sparse features (bag-of-words NLP, sparse recommenders) | Effective LR decays to zero; useless for long training |
| RMSProp | EMA of squared gradient instead of full sum | Non-stationary objectives; RNN training | Lacks momentum; usually superseded by Adam |
| Adam | Momentum + per-param adaptive LR + bias correction | Default for transformer pretraining; works without tuning on most problems | Default LR is not comparable to SGD's |
| AdamW | Adam with decoupled weight decay | LLM pretraining; any Adam workload that uses regularization | Tune weight decay separately from L2 — they are no longer the same |
Practical defaults: AdamW for transformers and most deep learning, SGD-momentum for computer vision when generalization matters and you can afford LR sweeps, AdaGrad only for genuinely sparse problems.
Common Confusions
Learning rate for Adam is not comparable to SGD learning rate
Adam's default learning rate is , while SGD often uses . These are not comparable because Adam divides by the second moment estimate. The effective step size for Adam is , which adapts per parameter.
Batch size affects the implicit learning rate
Doubling the batch size halves the gradient variance. If you double the batch size without adjusting the learning rate, you get a smoother but slower optimizer. The linear scaling rule suggests multiplying the learning rate by the batch size increase factor, but this is an approximation that breaks for very large batches.
Exercises
Problem
You train a model with SGD (no momentum) and learning rate 0.1. Training loss oscillates wildly. What two modifications would you try first, and why?
Problem
Adam uses bias correction terms and . Explain what happens at without bias correction. Compute the uncorrected and corrected first moment estimates for with and .
References
Canonical:
- Robbins & Monro, "A Stochastic Approximation Method" (1951)
- Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2015), ICLR
Current:
-
Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (2019), ICLR (AdamW)
-
Zhang et al., "Gradient Descent Happens in a Tiny Subspace" (2019), for intuition on adaptive methods
-
Jin, Ge, Netrapalli, Kakade, Jordan, "How to Escape Saddle Points Efficiently" (2017), ICML; arXiv:1703.00887. Perturbed GD finds an -SOSP in iterations.
-
Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5
-
Nesterov, Introductory Lectures on Convex Optimization (2004), Chapters 1-3
Next Topics
- Learning rate scheduling: how to adjust the learning rate during training
- Optimizer theory: deeper analysis of SGD, Adam, and newer optimizers
Last reviewed: April 14, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Differentiation in Rⁿlayer 0A · tier 1
- Convex Optimization Basicslayer 1 · tier 1
Derived topics
7- Adam Optimizerlayer 2 · tier 1
- Gradient Boostinglayer 2 · tier 1
- Learning Rate Schedulinglayer 2 · tier 1
- Stochastic Gradient Descent Convergencelayer 2 · tier 1
- Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1
+2 more on the derived-topics page.