Gradient Descent Variants

Sneiderman, Robby

Optimization Function Classes

Gradient Descent Variants

From full-batch to stochastic to mini-batch gradient descent, plus momentum, Nesterov acceleration, AdaGrad, RMSProp, and Adam. Why mini-batch SGD with momentum is the practical default.

ImportantCoreTier 1StableSupporting~45 min

For:ML

Prerequisites

Convex Optimization Basics Differentiation in Rn

Quiz (16)Pulse Check Prereq Map

Why This Matters

Every neural network is trained by some variant of gradient descent (or gradient ascent when maximizing a reward or likelihood). The choice of optimizer affects convergence speed, final model quality, and generalization. Understanding the differences between SGD, momentum, and Adam is necessary for debugging training and making informed choices.

Full-Batch Gradient Descent

Definition

Full-Batch Gradient Descent

Given a loss function $L(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(\theta; x_i, y_i)$ , full-batch gradient descent updates:

$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$

where $\eta > 0$ is the learning rate and $\nabla L(\theta_t)$ is computed using all $n$ training examples.

Full-batch GD computes the exact gradient. Each step is expensive ( $O(n)$ per update), and for large $n$ this is impractical. A single epoch requires one gradient computation and one parameter update.

Stochastic Gradient Descent

Definition

Stochastic Gradient Descent (SGD)

At each step, sample a single example $(x_i, y_i)$ uniformly at random and update:

$\theta_{t+1} = \theta_t - \eta \nabla \ell(\theta_t; x_i, y_i)$

The stochastic gradient $\nabla \ell(\theta_t; x_i, y_i)$ is an unbiased estimate of the full gradient: $\mathbb{E}_i[\nabla \ell(\theta_t; x_i, y_i)] = \nabla L(\theta_t)$ .

SGD is cheap per step ( $O(1)$ per update) but noisy. The noise can be beneficial: it helps escape shallow local minima and saddle points. But it also means the iterates oscillate and do not converge to an exact minimum without a decreasing learning rate schedule.

Mini-Batch SGD

Definition

Mini-Batch Gradient Descent

At each step, sample a mini-batch $B \subset \{1, \ldots, n\}$ of size $|B| = b$ and update:

$\theta_{t+1} = \theta_t - \eta \frac{1}{b} \sum_{i \in B} \nabla \ell(\theta_t; x_i, y_i)$

Typical batch sizes: 32, 64, 128, 256, or 512.

Mini-batch SGD is the practical default. It balances the noise reduction of averaging over multiple samples with the computational efficiency of not using all samples. Larger batches reduce variance (smoother updates) but cost more per step. Smaller batches add noise but allow more updates per epoch.

Theorem

SGD Convergence for Smooth Convex Functions

Statement

Let $L(\theta)$ be convex and $L$ -smooth. Let the stochastic gradient have bounded variance: $\mathbb{E}[\|\nabla \ell(\theta; x_i) - \nabla L(\theta)\|^2] \leq \sigma^2$ . With learning rate $\eta = \frac{1}{\sqrt{T}}$ , after $T$ steps of SGD:

$\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^{T} \|\nabla L(\theta_t)\|^2\right] \leq O\left(\frac{L \cdot \|\theta_0 - \theta^*\|^2}{\sqrt{T}} + \frac{\sigma}{\sqrt{T}}\right)$

The convergence rate is $O(1/\sqrt{T})$ , which is slower than the $O(1/T)$ rate of full-batch GD for smooth convex functions.

Intuition

The $\sigma / \sqrt{T}$ term is the price of stochasticity — no matter how many steps you take, the noise in the gradient prevents exact convergence with a fixed learning rate. You need a decreasing schedule ( $\eta_t \to 0$ ) for exact convergence, which slows things further.

Proof Sketch

Apply the descent lemma: $L(\theta_{t+1}) \leq L(\theta_t) + \langle \nabla L(\theta_t), \theta_{t+1} - \theta_t \rangle + \frac{L}{2}\|\theta_{t+1} - \theta_t\|^2$ . Substitute the SGD update, take expectations, and telescope the sum. The variance term $\sigma^2$ prevents the telescoping from collapsing to zero.

Why It Matters

This rate explains why SGD is slow in theory but fast in practice. The constant hidden in the $O$ notation depends on $n$ for full-batch but not for SGD. For large $n$ , SGD reaches a given accuracy much sooner in wall clock time despite the worse rate per iteration.

Failure Mode

The bound requires convexity. For non-convex objectives (neural networks), SGD convergence to a stationary point (not a minimum) has rate $O(1/\sqrt{T})$ under smoothness, but there is no guarantee the stationary point is good.

report a correction →

Stationary Points in Nonconvex Optimization

For nonconvex $L$ , not every stationary point is a local minimum. Saddle points satisfy $\nabla L(\theta) = 0$ but have directions of negative curvature. The standard taxonomy:

Definition

First-Order Stationary Point (FOSP) $\nabla L (θ) = 0$

A point $\theta$ is a first-order stationary point of a differentiable $L$ if and only if $\nabla L(\theta) = 0$ . An $\epsilon$ -FOSP satisfies $\|\nabla L(\theta)\| \leq \epsilon$ . FOSPs include local minima, local maxima, and saddle points.

Definition

Second-Order Stationary Point (SOSP) $\nabla L (θ) = 0 and \nabla^{2} L (θ) ⪰ 0$

A point $\theta$ is a second-order stationary point of a twice-differentiable $L$ if and only if $\nabla L(\theta) = 0$ and $\nabla^2 L(\theta) \succeq 0$ (no negative eigenvalue). An $\epsilon$ -SOSP satisfies $\|\nabla L(\theta)\| \leq \epsilon$ and $\lambda_{\min}(\nabla^2 L(\theta)) \geq -\sqrt{\rho \epsilon}$ for a Hessian-Lipschitz constant $\rho$ . Saddle points are FOSPs but not SOSPs.

Plain gradient descent can get stuck at strict saddle points. Jin, Ge, Netrapalli, Kakade, Jordan (2017), "How to Escape Saddle Points Efficiently", showed that perturbed gradient descent (adding occasional isotropic noise) finds an $\epsilon$ -SOSP in $\tilde{O}(1/\epsilon^2)$ iterations for Hessian-Lipschitz objectives. This matches the rate for finding an $\epsilon$ -FOSP up to polylogarithmic factors, so escaping saddles is nearly free.

Momentum

Definition

SGD with Momentum

Momentum adds a velocity term that accumulates past gradients:

$v_{t+1} = \beta v_t + \nabla L(\theta_t)$ $\theta_{t+1} = \theta_t - \eta v_{t+1}$

where $\beta \in [0, 1)$ is the momentum coefficient, typically 0.9.

Momentum smooths out oscillations in directions with high curvature and accelerates progress in directions with consistent gradient. It acts like a heavy ball rolling downhill: it keeps moving even when the local gradient is small.

Nesterov Momentum

Nesterov's modification computes the gradient at the "look-ahead" position:

$v_{t+1} = \beta v_t + \nabla L(\theta_t - \eta \beta v_t)$ $\theta_{t+1} = \theta_t - \eta v_{t+1}$

The idea: evaluate the gradient where momentum would take you, not where you currently are. This provides a corrective signal that reduces overshooting.

Adaptive Learning Rate Methods

AdaGrad

Definition

AdaGrad

AdaGrad scales the learning rate per parameter by the accumulated squared gradients:

$G_{t+1} = G_t + g_t \odot g_t$ $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_{t+1} + \epsilon}} \odot g_t$

where $g_t = \nabla L(\theta_t)$ , $\odot$ is element-wise multiplication, and $\epsilon \approx 10^{-8}$ prevents division by zero.

AdaGrad gives larger updates to infrequent parameters and smaller updates to frequent ones. This is good for sparse features. The problem: $G_t$ only grows, so the effective learning rate monotonically decreases and eventually becomes too small to make progress.

RMSProp

RMSProp fixes AdaGrad's decaying learning rate by using an exponential moving average instead of an accumulation:

$v_{t+1} = \rho v_t + (1 - \rho) g_t \odot g_t$ $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_{t+1} + \epsilon}} \odot g_t$

where $\rho \approx 0.99$ . The running average forgets old gradients, so the effective learning rate adapts to recent gradient magnitudes rather than all past gradients.

Adam

Proposition

Adam Optimizer

Statement

Adam combines momentum (first moment) and RMSProp (second moment) with bias correction. Let $g_t = \nabla L(\theta_t)$ :

$m_{t+1} = \beta_1 m_t + (1 - \beta_1) g_t$ $v_{t+1} = \beta_2 v_t + (1 - \beta_2) g_t \odot g_t$ $\hat{m}_{t+1} = \frac{m_{t+1}}{1 - \beta_1^{t+1}}, \quad \hat{v}_{t+1} = \frac{v_{t+1}}{1 - \beta_2^{t+1}}$ $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_{t+1}} + \epsilon} \odot \hat{m}_{t+1}$

Default hyperparameters: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\eta = 10^{-3}$ , $\epsilon = 10^{-8}$ .

Intuition

Adam adapts the learning rate per parameter using the ratio of the first moment (gradient direction) to the square root of the second moment (gradient magnitude). Parameters with consistently large gradients get smaller effective learning rates; parameters with small gradients get larger ones. The bias correction accounts for the zero initialization of $m$ and $v$ .

Proof Sketch

The bias correction ensures $\mathbb{E}[\hat{m}_t] = \mathbb{E}[g_t]$ and $\mathbb{E}[\hat{v}_t] = \mathbb{E}[g_t^2]$ under stationarity. Without correction, the initial estimates are biased toward zero because $m_0 = 0$ and $v_0 = 0$ .

Why It Matters

Adam is the most widely used optimizer for deep learning. It converges reliably with the default hyperparameters $(\beta_1=0.9, \beta_2=0.999, \epsilon=10^{-8})$ for Transformer language models at standard scales (Kingma and Ba, 2015) and for most supervised deep learning tasks. Its per-parameter adaptive learning rate means it requires less tuning than SGD with momentum.

Failure Mode

Adam can converge to worse solutions than SGD with momentum on some problems, particularly in computer vision (ResNets on ImageNet). The adaptive learning rate can be too aggressive, leading to flat minima that generalize poorly. AdamW (Adam with decoupled weight decay) fixes some of these issues.

report a correction →

Choosing an Optimizer

Method	Update	Best for	Watch out for
SGD	$\theta \leftarrow \theta - \eta \nabla L$	Convex objectives; small models where every FLOP matters	Oscillation around the optimum; needs a decreasing schedule for exact convergence
SGD + momentum	$v \leftarrow \beta v + \nabla L;\; \theta \leftarrow \theta - \eta v$	Vision with strong augmentation; settings where you can afford LR tuning	Wrong $\beta$ overshoots; momentum hides instability
AdaGrad	Per-parameter LR scaled by $\sqrt{\sum_t g_t^2}$	Sparse features (bag-of-words NLP, sparse recommenders)	Effective LR decays to zero; useless for long training
RMSProp	EMA of squared gradient instead of full sum	Non-stationary objectives; RNN training	Lacks momentum; usually superseded by Adam
Adam	Momentum + per-param adaptive LR + bias correction	Default for transformer pretraining; works without tuning on most problems	Default LR $10^{-3}$ is not comparable to SGD's $10^{-1}$
AdamW	Adam with decoupled weight decay	LLM pretraining; any Adam workload that uses regularization	Tune weight decay separately from L2 — they are no longer the same

Practical defaults: AdamW for transformers and most deep learning, SGD-momentum for computer vision when generalization matters and you can afford LR sweeps, AdaGrad only for genuinely sparse problems.

Common Confusions

Watch Out

Learning rate for Adam is not comparable to SGD learning rate

Adam's default learning rate is $10^{-3}$ , while SGD often uses $10^{-1}$ . These are not comparable because Adam divides by the second moment estimate. The effective step size for Adam is $\eta / \sqrt{\hat{v}_t}$ , which adapts per parameter.

Watch Out

Batch size affects the implicit learning rate

Doubling the batch size halves the gradient variance. If you double the batch size without adjusting the learning rate, you get a smoother but slower optimizer. The linear scaling rule suggests multiplying the learning rate by the batch size increase factor, but this is an approximation that breaks for very large batches.

Exercises

ExerciseCore

Problem

You train a model with SGD (no momentum) and learning rate 0.1. Training loss oscillates wildly. What two modifications would you try first, and why?

ExerciseAdvanced

Problem

Adam uses bias correction terms $1 - \beta_1^t$ and $1 - \beta_2^t$ . Explain what happens at $t = 1$ without bias correction. Compute the uncorrected and corrected first moment estimates for $t = 1$ with $\beta_1 = 0.9$ and $g_1 = 5.0$ .

References

Canonical:

Robbins & Monro, "A Stochastic Approximation Method" (1951)
Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2015), ICLR

Current:

Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (2019), ICLR (AdamW)
Zhang et al., "Gradient Descent Happens in a Tiny Subspace" (2019), for intuition on adaptive methods
Jin, Ge, Netrapalli, Kakade, Jordan, "How to Escape Saddle Points Efficiently" (2017), ICML; arXiv:1703.00887. Perturbed GD finds an $\epsilon$ -SOSP in $\tilde{O}(1/\epsilon^2)$ iterations.
Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5
Nesterov, Introductory Lectures on Convex Optimization (2004), Chapters 1-3

Next Topics

Learning rate scheduling: how to adjust the learning rate during training
Optimizer theory: deeper analysis of SGD, Adam, and newer optimizers

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Differentiation in Rⁿlayer 0A · tier 1
Convex Optimization Basicslayer 1 · tier 1

Derived topics

7

Adam Optimizerlayer 2 · tier 1
Gradient Boostinglayer 2 · tier 1
Learning Rate Schedulinglayer 2 · tier 1
Stochastic Gradient Descent Convergencelayer 2 · tier 1
Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1

+2 more on the derived-topics page.

Graph-backed continuations

Learning Rate Scheduling Optimizer Theory: SGD, Adam, and Muon Adam Optimizer Gradient Boosting Implicit Bias and Modern Generalization Physics-Informed Neural Networks Stochastic Gradient Descent Convergence