Skip to main content

Training Techniques

Adam Optimizer

Adaptive moment estimation: first and second moment tracking, bias correction, AdamW vs Adam+L2, learning rate warmup, and when Adam fails compared to SGD.

CoreTier 1StableCore spine~55 min

Why This Matters

Adam is a common default optimizer for training deep neural networks, especially Transformer-style models. It combines momentum with coordinate-wise adaptive learning rates, which helps on noisy, high-dimensional optimization problems where different parameters live on different scales. See the Adam paper breakdown for the original 2014 derivation, the bias-correction proof, and the AMSGrad/AdamW correction line.

Understanding Adam means understanding four things cold: bias correction, squared-gradient scaling, why AdamW is not the same as Adam plus L2 regularization, and why the optimizer that wins training loss may not win generalization. Adam absorbs many coordinate-scale mistakes, but it does not remove the need to reason about step size, weight decay, warmup, numerical precision, and validation curves.

theorem visual

Adam separates direction from scale

The gradient feeds two memories: one estimates direction, the other estimates coordinate-wise magnitude. Bias correction makes the early steps match the intended scale.

g_tgradientm_tdirectionv_tscalem_hatdebiasedv_hatdebiasedstepratioupdate = debiased direction divided by debiased scalefirst momentsecond momentadaptive step

first moment

Momentum memory. It smooths gradient direction.

second moment

Scale memory. Large recent gradients shrink future coordinate-wise steps.

Adam step

Bias correction prevents the early update ratio from being distorted by zero initialization.

Mental Model

SGD with momentum keeps a running average of gradients to smooth out noise. RMSprop keeps a running average of squared gradients to adapt the learning rate per-parameter (parameters with large gradients get smaller steps). Adam combines both: it maintains a momentum vector (first moment) and an adaptive scaling vector based on squared gradients (second raw moment), with bias correction to handle initialization.

The Algorithm

Definition

Adam Update Rule

Given parameters θ\theta, learning rate η\eta, decay rates β1,β2\beta_1, \beta_2, and small constant ϵ\epsilon:

At step tt, with gradient gt=θL(θt1)g_t = \nabla_\theta \mathcal{L}(\theta_{t-1}):

mt=β1mt1+(1β1)gt(first moment estimate)m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment estimate)} vt=β2vt1+(1β2)gt2(squared-gradient estimate)v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(squared-gradient estimate)} m^t=mt1β1t(bias-corrected first moment)\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \quad \text{(bias-corrected first moment)} v^t=vt1β2t(bias-corrected second moment)\hat{v}_t = \frac{v_t}{1 - \beta_2^t} \quad \text{(bias-corrected second moment)} θt=θt1ηm^tv^t+ϵ\theta_t = \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Common starting hyperparameters: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}.

Components Explained

Definition

First Moment (Momentum)

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t is an exponential moving average of gradients. With β1=0.9\beta_1 = 0.9, this averages roughly the last 10 gradients. It smooths out gradient noise and accumulates direction, like a ball rolling downhill with momentum.

Expanding: mt=(1β1)i=1tβ1tigim_t = (1-\beta_1)\sum_{i=1}^t \beta_1^{t-i} g_i.

Definition

Second Moment (Adaptive Scaling)

vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 is an exponential moving average of squared gradients (elementwise). With β2=0.999\beta_2 = 0.999, this averages roughly the last 1000 squared gradients. It estimates an uncentered second raw moment, not a centered variance: Adam tracks gradient magnitude, not E[(gtEgt)2]\mathbb{E}[(g_t-\mathbb{E}g_t)^2].

Dividing by vt\sqrt{v_t} gives each parameter its own effective learning rate: parameters with consistently large gradients get smaller steps, and parameters with small gradients get larger steps.

Bias Correction

Theorem

Bias Correction for Exponential Moving Averages

Statement

If m0=0m_0 = 0 and gtg_t are drawn from a stationary distribution with mean E[gt]=g\mathbb{E}[g_t] = g, then the raw exponential moving average is biased:

E[mt]=(1βt)g\mathbb{E}[m_t] = (1 - \beta^t) \cdot g

The bias-corrected estimate m^t=mt/(1βt)\hat{m}_t = m_t / (1 - \beta^t) satisfies E[m^t]=g\mathbb{E}[\hat{m}_t] = g.

Intuition

When you initialize m0=0m_0 = 0 and start averaging, the early estimates are biased toward zero. After one step with β=0.9\beta = 0.9, you have m1=0.1g1m_1 = 0.1 \cdot g_1, which underestimates the true gradient by a factor of 10. Dividing by (10.91)=0.1(1 - 0.9^1) = 0.1 corrects this. The correction matters most in the first few iterations and becomes negligible as tt grows (since βt0\beta^t \to 0).

Proof Sketch

mt=(1β)i=1tβtigim_t = (1-\beta)\sum_{i=1}^t \beta^{t-i} g_i.

E[mt]=(1β)gi=1tβti=(1β)g1βt1β=(1βt)g\mathbb{E}[m_t] = (1-\beta) g \sum_{i=1}^t \beta^{t-i} = (1-\beta) g \cdot \frac{1-\beta^t}{1-\beta} = (1-\beta^t) g.

So E[mt/(1βt)]=g\mathbb{E}[m_t/(1-\beta^t)] = g. For the second moment, the same argument applies with g2g^2 replaced by E[gt2]\mathbb{E}[g_t^2].

Why It Matters

Without bias correction, the first few Adam steps are destabilized in magnitude, not shrunk. Both mtm_t and vtv_t are biased toward zero at initialization, but they are biased by different factors: at step tt, mtm_t carries a factor of (1β1t)(1 - \beta_1^t) and vtv_t carries a factor of (1β2t)(1 - \beta_2^t). With the default β1=0.9\beta_1 = 0.9 and β2=0.999\beta_2 = 0.999, at t=1t = 1 the first-moment bias is 0.10.1 while the second-moment bias is 0.0010.001. The uncorrected update ratio mt/vtm_t / \sqrt{v_t} therefore scales like 0.1/0.0013.160.1 / \sqrt{0.001} \approx 3.16 times the corrected ratio, so the first step is roughly 3×3\times larger in magnitude than intended. With β2=0.999\beta_2 = 0.999 the second-moment bias decays slowly: even at t=1000t = 1000 the factor (1β21000)0.63(1 - \beta_2^{1000}) \approx 0.63 is materially different from 1. Bias correction prevents this early overshoot, which is why Adam includes it. The Canonical Example below works through the factor-of-3×\sim 3\times overshoot at t=1t = 1 explicitly.

Failure Mode

Bias correction assumes a stationary gradient distribution. In the early phase of training when the loss landscape changes rapidly, the stationarity assumption is violated. This is one motivation for learning rate warmup.

AdamW: Decoupled Weight Decay

Theorem

AdamW Decouples Weight Decay from Gradient Adaptation

Statement

Adam + L2 regularization adds the L2 gradient to the gradient before moment estimation:

gtL2=L(θt1)+λθt1g_t^{\mathrm{L2}} = \nabla \mathcal{L}(\theta_{t-1}) + \lambda \theta_{t-1}

then runs standard Adam on gtL2g_t^{\mathrm{L2}}. In a coordinate-wise view, the immediate decay contribution is scaled by the adaptive denominator, so parameter jj is shrunk roughly in proportion to ηλ/(v^t,j+ϵ)\eta \lambda / (\sqrt{\hat{v}_{t,j}}+\epsilon). The regularization term also enters the future moment estimates.

AdamW applies weight decay directly to the parameters, after the adaptive step:

θt=(1ηλ)θt1ηm^tv^t+ϵ\theta_t = (1 - \eta\lambda)\theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

In AdamW, the weight decay λ\lambda is the same for all parameters, regardless of gradient magnitude.

Intuition

In SGD, L2 regularization and weight decay are equivalent. In Adam, they are not. When Adam divides the gradient by vt\sqrt{v_t}, it also divides the L2 gradient, weakening the regularization for parameters with large gradient history. AdamW avoids this by applying decay separately. This makes the decay strength easier to tune because it is not tied to the adaptive gradient denominator.

Proof Sketch

With Adam + L2: the effective update for parameter jj includes ηλθj/(v^t,j+ϵ)\eta \cdot \lambda\theta_j / (\sqrt{\hat{v}_{t,j}}+\epsilon) in the simplified coordinate view. Parameters with large v^t,j\hat{v}_{t,j} (historically large gradients) receive weaker immediate decay, and the L2 term also changes the moment statistics used in later steps.

With AdamW: the decay term is ηλθj\eta\lambda\theta_j regardless of v^j\hat{v}_j. The gradient-based update and the decay are fully decoupled.

Why It Matters

Loshchilov and Hutter (2019) showed in their tested settings that decoupled weight decay fixes a real mismatch between adaptive scaling and L2 regularization. The key insight is operational: regularization should not be silently rescaled by the optimizer's denominator. AdamW is the usual starting point for Transformer training. For some vision and CNN regimes, tuned SGD with momentum remains competitive and can generalize better (Wilson et al. 2017).

Failure Mode

The optimal λ\lambda for AdamW is different from the optimal λ\lambda for Adam+L2. You cannot simply swap one for the other without retuning. Common Transformer configurations often use weight decay around 0.01 to 0.1, but the right value depends on schedule, batch size, model size, and which parameters are excluded from decay.

Learning Rate Warmup

In practice, Adam is often combined with learning rate warmup: start with a very small learning rate and linearly increase it over the first TwT_w steps to the target value. Why?

  1. Second moment initialization: At step 1, v^1=g12\hat{v}_1 = g_1^2, a noisy single-sample estimate. A large learning rate with a noisy denominator produces wild parameter updates. Warmup gives vtv_t time to stabilize.

  2. Loss landscape curvature: Early in training, the loss landscape may have regions of very high curvature. Large steps in these regions can be catastrophic. Warmup allows the model to reach a better-conditioned region before taking large steps.

A common schedule is linear warmup for 1-10% of total training steps, followed by cosine decay.

When Adam Fails

Adam is not universally superior to SGD:

  1. Generalization gap (contested): Wilson et al. (2017) reported that SGD with momentum often generalizes better than Adam on image classification. One proposed mechanism is that Adam finds sharper minima while SGD's larger noise finds flatter ones (Keskar et al. 2017), but the flat-minima hypothesis itself has been challenged on reparameterization grounds (Dinh et al. 2017). The generalization gap is real; its causal mechanism is not settled.

  2. Non-convergence: Reddi et al. (2018) showed that Adam can diverge on simple convex problems because the adaptive learning rate can increase without bound when vtv_t shrinks. AMSGrad fixes this by taking v^t=max(v^t1,vt)\hat{v}_t = \max(\hat{v}_{t-1}, v_t) to ensure the learning rate never increases. In practice AMSGrad is rarely used; the modification has not consistently improved empirical performance, and many deep-learning codebases default to AdamW.

  3. Domain dependence: AdamW is the usual first optimizer for NLP and Transformers. SGD with momentum is often competitive or stronger for CNNs on vision tasks when the schedule and regularization are tuned. The optimizer choice depends on architecture, data distribution, batch size, and compute budget.

Practical Training Checklist

SymptomAdam knob to inspectWhat to check
Training loss spikes earlyWarmup and base learning rateThe second moment estimate may be too noisy for the chosen step size
Validation lags trainingWeight decay and scheduleAdam can fit quickly while still needing regularization and decay tuning
fp16 or bf16 instabilityϵ\epsilon, loss scaling, fused kernelsThe denominator and precision path can dominate small-gradient coordinates
Slow adaptation after a regime changeβ2\beta_2A large β2\beta_2 remembers old squared gradients for many steps
LayerNorm or bias parameters decay oddlyParameter groupsMany Transformer recipes exclude bias and normalization parameters from weight decay

Canonical Examples

Example

Why bias correction matters early

With β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, and m0=v0=0m_0 = v_0 = 0, at step t=1t = 1:

  • Raw first moment: m1=0.1g1m_1 = 0.1 \cdot g_1 (biased by a factor of 0.10.1).
  • Raw second moment: v1=0.001g12v_1 = 0.001 \cdot g_1^2 (biased by a factor of 0.0010.001).
  • Corrected: m^1=g1\hat{m}_1 = g_1, v^1=g12\hat{v}_1 = g_1^2.

Because the two biases cancel partially, the uncorrected update ratio is

m1v1+ϵ    0.1g10.0316g1    3.16sign(g1),\frac{m_1}{\sqrt{v_1} + \epsilon} \;\approx\; \frac{0.1\, g_1}{0.0316\, |g_1|} \;\approx\; 3.16\, \operatorname{sign}(g_1),

while the corrected ratio is m^1/v^1=sign(g1)\hat{m}_1 / \sqrt{\hat{v}_1} = \operatorname{sign}(g_1). So the raw first step overshoots by roughly 3×3\times, not the naive 30×\sim 30\times that comes from looking at 1/v11/\sqrt{v_1} alone. The overshoot is still large enough to destabilize training at realistic learning rates, which is why bias correction is part of the algorithm.

Example

How adaptive scaling changes a step

Suppose two coordinates have the same debiased first moment, m^t,1=m^t,2=0.01\hat{m}_{t,1}=\hat{m}_{t,2}=0.01, but different squared-gradient histories: v^t,1=104\hat{v}_{t,1}=10^{-4} and v^t,2=102\hat{v}_{t,2}=10^{-2}. Ignoring ϵ\epsilon, the update ratios are:

0.01104=1,0.01102=0.1\frac{0.01}{\sqrt{10^{-4}}}=1,\qquad \frac{0.01}{\sqrt{10^{-2}}}=0.1

The second coordinate moves ten times less because its recent gradients have been larger. Adam is not just adding momentum; it is changing the geometry of the update by rescaling coordinates.

Common Confusions

Watch Out

Adam and AdamW are NOT interchangeable

Adam with L2 regularization and AdamW with weight decay produce different parameter trajectories, even with the same λ\lambda value. In Adam+L2, the regularization strength varies per parameter (inversely with gradient history). In AdamW, it is uniform. Always use AdamW when you want weight decay with adaptive optimizers.

Watch Out

The second moment is not the variance

vtv_t averages gt2g_t^2, not (gtgˉ)2(g_t-\bar g)^2. It is an uncentered squared-gradient scale used to normalize coordinates. Calling it a variance is common shorthand, but it hides the fact that a large mean gradient also makes vtv_t large.

Watch Out

The epsilon parameter is not negligible

The default ϵ=108\epsilon = 10^{-8} prevents division by zero, but in half-precision training this value may be below the useful numerical scale of the denominator path. IEEE fp16 has minimum normal value around 6×1056 \times 10^{-5} and unit roundoff around 5×1045 \times 10^{-4}. Many mixed-precision setups keep optimizer state in fp32 or use fused kernels, so the right fix is implementation-specific: inspect the precision path, then tune ϵ\epsilon, loss scaling, or the optimizer implementation.

Watch Out

Fast training loss does not prove a better optimizer

Adam often reduces training loss quickly. That is not the same as better validation performance. Compare validation loss, calibration, robustness, and sensitivity to weight decay and schedule before concluding that Adam is better than SGD for a given model family.

Summary

  • Adam = momentum (first moment) + adaptive LR (second moment) + bias correction
  • First moment: mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t (smooths gradients)
  • Second moment: vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 (uncentered squared-gradient scale)
  • Bias correction: divide by (1βt)(1-\beta^t) to correct for zero initialization
  • AdamW decouples weight decay from adaptive scaling. Use AdamW, not Adam+L2
  • Adam can fit quickly while still generalizing worse than tuned SGD in some regimes
  • Warmup stabilizes the second moment estimate in early training

Exercises

ExerciseCore

Problem

Derive the bias-corrected first moment estimate. Starting from m0=0m_0 = 0 and mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t, show that E[mt]=(1β1t)E[g]\mathbb{E}[m_t] = (1-\beta_1^t)\mathbb{E}[g] when the gradients have constant expectation E[gt]=g\mathbb{E}[g_t] = g.

ExerciseCore

Problem

Two coordinates have the same m^t,j\hat{m}_{t,j} but one has much larger v^t,j\hat{v}_{t,j}. Which coordinate gets the smaller Adam step? Explain the answer in terms of squared-gradient history.

ExerciseAdvanced

Problem

Show that for SGD (no adaptive scaling), L2 regularization (gtgt+λθg_t \leftarrow g_t + \lambda\theta) is equivalent to weight decay (θ(1ηλ)θηgt\theta \leftarrow (1-\eta\lambda)\theta - \eta g_t). Then explain why this equivalence breaks for Adam.

Related Comparisons

References

Canonical:

  • Kingma & Ba, "Adam: A Method for Stochastic Optimization" (ICLR 2015).
  • Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (ICLR 2019). Introduces AdamW.
  • Goodfellow, Bengio, and Courville, Deep Learning, Chapter 8. Optimization for training deep models.

Current:

  • Reddi et al., "On the Convergence of Adam and Beyond" (ICLR 2018). AMSGrad and non-convergence examples.
  • Wilson et al., "The Marginal Value of Adaptive Gradient Methods in Machine Learning" (NeurIPS 2017).
  • Bottou, Curtis, and Nocedal, "Optimization Methods for Large-Scale Machine Learning" (SIAM Review 2018), Sections 4-6.
  • Keskar et al., "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima" (ICLR 2017).
  • Dinh et al., "Sharp Minima Can Generalize for Deep Nets" (ICML 2017).

Next Topics

Adam connects to broader optimization topics:

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

5