Training Techniques
Adam Optimizer
Adaptive moment estimation: first and second moment tracking, bias correction, AdamW vs Adam+L2, learning rate warmup, and when Adam fails compared to SGD.
Why This Matters
Adam is a common default optimizer for training deep neural networks, especially Transformer-style models. It combines momentum with coordinate-wise adaptive learning rates, which helps on noisy, high-dimensional optimization problems where different parameters live on different scales. See the Adam paper breakdown for the original 2014 derivation, the bias-correction proof, and the AMSGrad/AdamW correction line.
Understanding Adam means understanding four things cold: bias correction, squared-gradient scaling, why AdamW is not the same as Adam plus L2 regularization, and why the optimizer that wins training loss may not win generalization. Adam absorbs many coordinate-scale mistakes, but it does not remove the need to reason about step size, weight decay, warmup, numerical precision, and validation curves.
Mental Model
SGD with momentum keeps a running average of gradients to smooth out noise. RMSprop keeps a running average of squared gradients to adapt the learning rate per-parameter (parameters with large gradients get smaller steps). Adam combines both: it maintains a momentum vector (first moment) and an adaptive scaling vector based on squared gradients (second raw moment), with bias correction to handle initialization.
The Algorithm
Adam Update Rule
Given parameters , learning rate , decay rates , and small constant :
At step , with gradient :
Common starting hyperparameters: , , .
Components Explained
First Moment (Momentum)
is an exponential moving average of gradients. With , this averages roughly the last 10 gradients. It smooths out gradient noise and accumulates direction, like a ball rolling downhill with momentum.
Expanding: .
Second Moment (Adaptive Scaling)
is an exponential moving average of squared gradients (elementwise). With , this averages roughly the last 1000 squared gradients. It estimates an uncentered second raw moment, not a centered variance: Adam tracks gradient magnitude, not .
Dividing by gives each parameter its own effective learning rate: parameters with consistently large gradients get smaller steps, and parameters with small gradients get larger steps.
Bias Correction
Bias Correction for Exponential Moving Averages
Statement
If and are drawn from a stationary distribution with mean , then the raw exponential moving average is biased:
The bias-corrected estimate satisfies .
Intuition
When you initialize and start averaging, the early estimates are biased toward zero. After one step with , you have , which underestimates the true gradient by a factor of 10. Dividing by corrects this. The correction matters most in the first few iterations and becomes negligible as grows (since ).
Proof Sketch
.
.
So . For the second moment, the same argument applies with replaced by .
Why It Matters
Without bias correction, the first few Adam steps are destabilized in magnitude, not shrunk. Both and are biased toward zero at initialization, but they are biased by different factors: at step , carries a factor of and carries a factor of . With the default and , at the first-moment bias is while the second-moment bias is . The uncorrected update ratio therefore scales like times the corrected ratio, so the first step is roughly larger in magnitude than intended. With the second-moment bias decays slowly: even at the factor is materially different from 1. Bias correction prevents this early overshoot, which is why Adam includes it. The Canonical Example below works through the factor-of- overshoot at explicitly.
Failure Mode
Bias correction assumes a stationary gradient distribution. In the early phase of training when the loss landscape changes rapidly, the stationarity assumption is violated. This is one motivation for learning rate warmup.
AdamW: Decoupled Weight Decay
AdamW Decouples Weight Decay from Gradient Adaptation
Statement
Adam + L2 regularization adds the L2 gradient to the gradient before moment estimation:
then runs standard Adam on . In a coordinate-wise view, the immediate decay contribution is scaled by the adaptive denominator, so parameter is shrunk roughly in proportion to . The regularization term also enters the future moment estimates.
AdamW applies weight decay directly to the parameters, after the adaptive step:
In AdamW, the weight decay is the same for all parameters, regardless of gradient magnitude.
Intuition
In SGD, L2 regularization and weight decay are equivalent. In Adam, they are not. When Adam divides the gradient by , it also divides the L2 gradient, weakening the regularization for parameters with large gradient history. AdamW avoids this by applying decay separately. This makes the decay strength easier to tune because it is not tied to the adaptive gradient denominator.
Proof Sketch
With Adam + L2: the effective update for parameter includes in the simplified coordinate view. Parameters with large (historically large gradients) receive weaker immediate decay, and the L2 term also changes the moment statistics used in later steps.
With AdamW: the decay term is regardless of . The gradient-based update and the decay are fully decoupled.
Why It Matters
Loshchilov and Hutter (2019) showed in their tested settings that decoupled weight decay fixes a real mismatch between adaptive scaling and L2 regularization. The key insight is operational: regularization should not be silently rescaled by the optimizer's denominator. AdamW is the usual starting point for Transformer training. For some vision and CNN regimes, tuned SGD with momentum remains competitive and can generalize better (Wilson et al. 2017).
Failure Mode
The optimal for AdamW is different from the optimal for Adam+L2. You cannot simply swap one for the other without retuning. Common Transformer configurations often use weight decay around 0.01 to 0.1, but the right value depends on schedule, batch size, model size, and which parameters are excluded from decay.
Learning Rate Warmup
In practice, Adam is often combined with learning rate warmup: start with a very small learning rate and linearly increase it over the first steps to the target value. Why?
-
Second moment initialization: At step 1, , a noisy single-sample estimate. A large learning rate with a noisy denominator produces wild parameter updates. Warmup gives time to stabilize.
-
Loss landscape curvature: Early in training, the loss landscape may have regions of very high curvature. Large steps in these regions can be catastrophic. Warmup allows the model to reach a better-conditioned region before taking large steps.
A common schedule is linear warmup for 1-10% of total training steps, followed by cosine decay.
When Adam Fails
Adam is not universally superior to SGD:
-
Generalization gap (contested): Wilson et al. (2017) reported that SGD with momentum often generalizes better than Adam on image classification. One proposed mechanism is that Adam finds sharper minima while SGD's larger noise finds flatter ones (Keskar et al. 2017), but the flat-minima hypothesis itself has been challenged on reparameterization grounds (Dinh et al. 2017). The generalization gap is real; its causal mechanism is not settled.
-
Non-convergence: Reddi et al. (2018) showed that Adam can diverge on simple convex problems because the adaptive learning rate can increase without bound when shrinks. AMSGrad fixes this by taking to ensure the learning rate never increases. In practice AMSGrad is rarely used; the modification has not consistently improved empirical performance, and many deep-learning codebases default to AdamW.
-
Domain dependence: AdamW is the usual first optimizer for NLP and Transformers. SGD with momentum is often competitive or stronger for CNNs on vision tasks when the schedule and regularization are tuned. The optimizer choice depends on architecture, data distribution, batch size, and compute budget.
Practical Training Checklist
| Symptom | Adam knob to inspect | What to check |
|---|---|---|
| Training loss spikes early | Warmup and base learning rate | The second moment estimate may be too noisy for the chosen step size |
| Validation lags training | Weight decay and schedule | Adam can fit quickly while still needing regularization and decay tuning |
| fp16 or bf16 instability | , loss scaling, fused kernels | The denominator and precision path can dominate small-gradient coordinates |
| Slow adaptation after a regime change | A large remembers old squared gradients for many steps | |
| LayerNorm or bias parameters decay oddly | Parameter groups | Many Transformer recipes exclude bias and normalization parameters from weight decay |
Canonical Examples
Why bias correction matters early
With , , and , at step :
- Raw first moment: (biased by a factor of ).
- Raw second moment: (biased by a factor of ).
- Corrected: , .
Because the two biases cancel partially, the uncorrected update ratio is
while the corrected ratio is . So the raw first step overshoots by roughly , not the naive that comes from looking at alone. The overshoot is still large enough to destabilize training at realistic learning rates, which is why bias correction is part of the algorithm.
How adaptive scaling changes a step
Suppose two coordinates have the same debiased first moment, , but different squared-gradient histories: and . Ignoring , the update ratios are:
The second coordinate moves ten times less because its recent gradients have been larger. Adam is not just adding momentum; it is changing the geometry of the update by rescaling coordinates.
Common Confusions
Adam and AdamW are NOT interchangeable
Adam with L2 regularization and AdamW with weight decay produce different parameter trajectories, even with the same value. In Adam+L2, the regularization strength varies per parameter (inversely with gradient history). In AdamW, it is uniform. Always use AdamW when you want weight decay with adaptive optimizers.
The second moment is not the variance
averages , not . It is an uncentered squared-gradient scale used to normalize coordinates. Calling it a variance is common shorthand, but it hides the fact that a large mean gradient also makes large.
The epsilon parameter is not negligible
The default prevents division by zero, but in half-precision training this value may be below the useful numerical scale of the denominator path. IEEE fp16 has minimum normal value around and unit roundoff around . Many mixed-precision setups keep optimizer state in fp32 or use fused kernels, so the right fix is implementation-specific: inspect the precision path, then tune , loss scaling, or the optimizer implementation.
Fast training loss does not prove a better optimizer
Adam often reduces training loss quickly. That is not the same as better validation performance. Compare validation loss, calibration, robustness, and sensitivity to weight decay and schedule before concluding that Adam is better than SGD for a given model family.
Summary
- Adam = momentum (first moment) + adaptive LR (second moment) + bias correction
- First moment: (smooths gradients)
- Second moment: (uncentered squared-gradient scale)
- Bias correction: divide by to correct for zero initialization
- AdamW decouples weight decay from adaptive scaling. Use AdamW, not Adam+L2
- Adam can fit quickly while still generalizing worse than tuned SGD in some regimes
- Warmup stabilizes the second moment estimate in early training
Exercises
Problem
Derive the bias-corrected first moment estimate. Starting from and , show that when the gradients have constant expectation .
Problem
Two coordinates have the same but one has much larger . Which coordinate gets the smaller Adam step? Explain the answer in terms of squared-gradient history.
Problem
Show that for SGD (no adaptive scaling), L2 regularization () is equivalent to weight decay (). Then explain why this equivalence breaks for Adam.
Related Comparisons
References
Canonical:
- Kingma & Ba, "Adam: A Method for Stochastic Optimization" (ICLR 2015).
- Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (ICLR 2019). Introduces AdamW.
- Goodfellow, Bengio, and Courville, Deep Learning, Chapter 8. Optimization for training deep models.
Current:
- Reddi et al., "On the Convergence of Adam and Beyond" (ICLR 2018). AMSGrad and non-convergence examples.
- Wilson et al., "The Marginal Value of Adaptive Gradient Methods in Machine Learning" (NeurIPS 2017).
- Bottou, Curtis, and Nocedal, "Optimization Methods for Large-Scale Machine Learning" (SIAM Review 2018), Sections 4-6.
- Keskar et al., "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima" (ICLR 2017).
- Dinh et al., "Sharp Minima Can Generalize for Deep Nets" (ICML 2017).
Next Topics
Adam connects to broader optimization topics:
- Learning rate schedules: warmup, cosine decay, and their interaction with Adam.
- Optimizer theory: SGD, Adam, Muon: why optimizer choice depends on geometry and architecture.
- Mixed precision training: why denominator precision, loss scaling, and fused kernels matter.
- Transformer architecture: where AdamW became the standard starting point.
Last reviewed: April 25, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Gradient Descent Variantslayer 1 · tier 1
- Stochastic Gradient Descent Convergencelayer 2 · tier 1
Derived topics
5- Learning Rate Schedulinglayer 2 · tier 1
- Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1
- Batch Size and Learning Dynamicslayer 2 · tier 2
- Mixed Precision Traininglayer 3 · tier 2
- Transformer Architecturelayer 4 · tier 2