Batch Size and Learning Dynamics

Sneiderman, Robby

Training Techniques

Batch Size and Learning Dynamics

How batch size affects what SGD finds: gradient noise, implicit regularization, the linear scaling rule, sharp vs flat minima, and the gradient noise scale as the key quantity governing the tradeoff.

CoreTier 2CurrentSupporting~45 min

Prerequisites

Stochastic Gradient Descent Convergence Adam Optimizer

Quiz (6)Pulse Check Prereq Map

Why This Matters

Batch size is not just a memory/compute tradeoff. It changes the noise structure of SGD, which changes the loss landscape regions the optimizer explores, which changes the generalization properties of the trained model. Small-batch SGD and large-batch SGD can converge to different solutions with measurably different test performance.

Understanding this requires connecting optimization (convergence rate), statistics (gradient variance), and geometry (curvature of the loss surface).

Mental Model

SGD computes a gradient estimate from a mini-batch of size $B$ . Small $B$ means noisy gradients: the optimizer takes jittery steps that help it escape sharp minima. Large $B$ means accurate gradients: the optimizer takes confident steps but may get stuck in the nearest sharp minimum. The noise is not a bug; it is an implicit regularizer.

The key quantity is the ratio of gradient noise to gradient signal. This ratio determines whether the optimizer behaves more like SGD (noisy, exploratory) or more like full-batch gradient descent (deterministic, exploitative).

Formal Setup

Consider minimizing $L(\theta) = \mathbb{E}_{z \sim \mathcal{D}}[\ell(\theta; z)]$ using mini-batch SGD. A mini-batch $\mathcal{B}$ of size $B$ gives the gradient estimate:

$g_\mathcal{B}(\theta) = \frac{1}{B}\sum_{z \in \mathcal{B}} \nabla_\theta \ell(\theta; z)$

Definition

Gradient Noise Covariance

The covariance of the mini-batch gradient estimate is:

$\Sigma(\theta) = \text{Cov}[g_\mathcal{B}(\theta)] = \frac{1}{B} C(\theta)$

where $C(\theta) = \text{Cov}_{z}[\nabla_\theta \ell(\theta; z)]$ is the per-sample gradient covariance. The noise scales as $1/B$ : doubling the batch size halves the gradient variance.

Definition

Gradient Noise Scale $B_{noise}$

The gradient noise scale (McCandlish et al., 2018) is:

$B_{\text{noise}} = \frac{\text{tr}(C(\theta))}{\|\nabla L(\theta)\|^2}$

This is the ratio of gradient variance to gradient signal squared. When $B \ll B_{\text{noise}}$ , noise dominates and increasing $B$ gives near-linear speedup. When $B \gg B_{\text{noise}}$ , signal dominates and increasing $B$ gives diminishing returns.

Main Theorems

Proposition

Linear Scaling Rule

Statement

If SGD with learning rate $\eta$ and batch size $B$ produces a certain training trajectory (in the continuous-time SDE approximation), then SGD with learning rate $k\eta$ and batch size $kB$ produces approximately the same trajectory, for any scaling factor $k > 0$ .

The effective noise temperature of SGD is:

$T_{\text{eff}} = \frac{\eta}{B} \cdot \text{tr}(C(\theta))$

Scaling both $\eta$ and $B$ by $k$ preserves $T_{\text{eff}}$ .

Intuition

What matters for the dynamics is the ratio $\eta/B$ , not $\eta$ or $B$ individually. This ratio controls the magnitude of the noise injected per step. If you use $4\times$ larger batches, you must use $4\times$ larger learning rate to maintain the same effective noise.

Proof Sketch

In the continuous-time limit, SGD on the loss $L(\theta)$ is approximated by the SDE:

$d\theta = -\nabla L(\theta)\,dt + \sqrt{\frac{\eta}{B}} C(\theta)^{1/2}\,dW_t$

The drift term $-\nabla L(\theta)$ is independent of $\eta$ and $B$ (after rescaling time by $\eta$ ). The diffusion coefficient depends on $\eta/B$ . Scaling $\eta \to k\eta$ and $B \to kB$ preserves the diffusion coefficient.

Why It Matters

This is the theoretical justification for the practice of scaling the learning rate linearly with batch size, used in large-scale distributed training (Goyal et al., 2017). Without this rule, increasing the batch size would reduce noise and change the implicit regularization of SGD.

Failure Mode

The linear scaling rule breaks when: (1) the learning rate becomes so large that the continuous-time SDE approximation fails (discrete effects dominate), (2) the loss landscape is not smooth enough for the local diffusion approximation, or (3) during the initial transient before SGD reaches a stationary regime. In practice, a warm-up period is needed for large learning rates.

report a correction →

Proposition

Critical Batch Size and Training Efficiency

Statement

Let $B_{\text{noise}} = \text{tr}(C(\theta)) / \|\nabla L(\theta)\|^2$ be the gradient noise scale. The number of serial optimization steps $S$ to reach a target loss $\epsilon$ scales as:

$S(B) \approx S_{\min} \cdot \left(1 + \frac{B_{\text{noise}}}{B}\right)$

where $S_{\min}$ is the minimum steps achievable (at $B \to \infty$ ). The total compute (in units of samples processed) is:

$E(B) = B \cdot S(B) = S_{\min} \cdot (B + B_{\text{noise}})$

These two quantities trace out a Pareto frontier. $S(B)$ is monotonically decreasing in $B$ (large batches minimize wall-clock time). $E(B)$ is monotonically increasing in $B$ (small batches minimize total compute). $B \approx B_{\text{noise}}$ is the knee of this frontier: below it, serial steps explode while compute barely falls; above it, compute grows roughly linearly in $B$ while steps barely decrease.

Intuition

For $B \ll B_{\text{noise}}$ : noise dominates. Doubling $B$ nearly halves the serial step count at roughly constant total compute (near-linear speedup in time, near-free). For $B \gg B_{\text{noise}}$ : signal dominates. Doubling $B$ barely reduces steps, so total compute roughly doubles with almost no time reduction (pure waste). At the knee $B \approx B_{\text{noise}}$ , you pay roughly double the compute to halve the serial time: a balanced tradeoff, not a compute minimum.

Proof Sketch

The per-step improvement in loss is approximately $\eta \|\nabla L\|^2 - \eta^2 \text{tr}(\Sigma)/2$ . With optimal $\eta \propto B/(B + B_{\text{noise}})$ (balancing progress against noise), the per-step improvement is $\propto B/(B + B_{\text{noise}})$ . Inverting gives the step count formula.

Why It Matters

$B_{\text{noise}}$ is measurable during training (estimate gradient variance from multiple mini-batches). It marks the Pareto knee between compute efficiency and wall-clock speed: below it, time gains are cheap; above it, time gains are expensive in compute. McCandlish et al. (2018), Section 2.2, frame this as the tradeoff curve between serial steps and total examples, and measure $B_{\text{noise}}$ for language models, finding that it increases during training, which is why larger batches become useful later.

Failure Mode

The analysis assumes the gradient variance $C(\theta)$ is approximately constant, which changes as training progresses. The noise scale $B_{\text{noise}}$ itself varies over training, so the optimal batch size is not a single number but a trajectory.

report a correction →

Sharp vs Flat Minima

Proposition

Small Batch SGD Favors Flat Minima

Statement

Under a constant-noise approximation of the SGD diffusion (Mandt, Hoffman, Blei 2017), the stationary density takes the Gibbs-like form:

$p(\theta) \propto \exp\left(-\frac{2B}{\eta \cdot \text{tr}(C)} L(\theta)\right)$

A common heuristic extension adds a curvature-dependent prefactor $|\det H(\theta)|^{-1/2}$ to argue that flatter minima (smaller Hessian eigenvalues) have higher stationary mass. This is a useful baseline, not a rigorous result: real SGD has state-dependent gradient covariance $C(\theta)$ , which breaks detailed balance and the Gibbs form (Chaudhari and Soatto, 2018, "SGD performs variational inference").

Intuition

SGD noise acts like a temperature that helps the optimizer escape sharp minima (high curvature) but not flat minima (low curvature). The wider a minimum is, the harder it is for noise to push the iterate out. Higher noise temperature ( $\eta/B$ large) means only the flattest minima are stable. This picture is clean only when the noise is isotropic and state-independent, which is an idealization.

Proof Sketch

Approximate SGD as a Langevin diffusion $d\theta = -\nabla L\,dt + \sigma\,dW$ with constant $\sigma^2 = \eta \cdot \text{tr}(C)/B$ , following Mandt et al. (2017), who analyze the Ornstein-Uhlenbeck limit around a local minimum. Under constant covariance the stationary distribution of Langevin dynamics is the Gibbs measure $p \propto \exp(-2L/\sigma^2)$ . When $C(\theta)$ depends on $\theta$ , this derivation no longer applies: Chaudhari and Soatto (2018) show the stationary measure is generally non-Gibbsian and picks up correction terms.

Why It Matters

The Gibbs heuristic gives a theoretical narrative for the empirical observation that small-batch SGD often generalizes better than large-batch SGD: flat minima are associated with better generalization because nearby points have similar loss. This is an observation, not a theorem about generalization itself, and the underlying stationary-density formula is an approximation, not a rigorous identity for real SGD.

Failure Mode

The sharp/flat minima story has known weaknesses. Dinh et al. (2017) showed you can reparameterize a network to make any minimum arbitrarily sharp without changing the function it computes. The SDE approximation also breaks for large learning rates. The state-dependent gradient covariance in real SGD violates the constant-noise assumption that makes the Gibbs form exact. Flatness measured by the Hessian trace may not correlate with generalization in all architectures.

report a correction →

Practical Implications

Warm-up. When using the linear scaling rule with large batch sizes, the initial learning rate is large. A warm-up period (linearly increasing $\eta$ from a small value over the first few epochs) prevents divergence during the initial transient when the loss landscape curvature is high.

LARS and LAMB. For very large batch sizes ( $B > 8192$ ), layer-wise adaptive learning rates (LARS for SGD, LAMB for Adam) adjust the step size per layer based on the ratio of parameter norm to gradient norm. This compensates for different layers having different gradient noise scales.

Diminishing returns. Training large language models, $B_{\text{noise}}$ is typically $10^5$ to $10^7$ tokens. Beyond this, more parallelism costs compute roughly linearly in $B$ while reducing serial steps only marginally, so wall-clock time does not fall in proportion to the compute you spend.

Canonical Examples

Example

ResNet-50 on ImageNet: batch size scaling

Goyal et al. (2017) trained ResNet-50 with batch sizes from 256 to 8192. With the linear scaling rule ( $\eta \propto B$ ) and warm-up, they achieved equivalent accuracy across all batch sizes. At $B = 8192$ (32x baseline), training completed in $\sim$ 1 hour on 256 GPUs. Beyond $B \approx 8192$ , accuracy began to degrade, consistent with $B_{\text{noise}}$ being roughly in that range for this task.

Example

Gradient noise scale in language modeling

McCandlish et al. (2018) measured $B_{\text{noise}}$ for a Transformer language model during training. Early in training, $B_{\text{noise}} \approx 10^3$ (small batches suffice because gradients are noisy relative to their magnitude). Late in training, $B_{\text{noise}} \approx 10^6$ (the model is near a minimum, gradients are small, so you need large batches to estimate the direction accurately).

Common Confusions

Watch Out

B_noise is not a compute minimum

$B_{\text{noise}}$ is the knee of a Pareto frontier, not a minimizer of total compute. Total compute $E(B) = S_{\min}(B + B_{\text{noise}})$ is strictly increasing in $B$ : the compute- optimal batch is the smallest one you can run. Serial steps $S(B) = S_{\min}(1 + B_{\text{noise}}/B)$ are strictly decreasing in $B$ : the time-optimal batch is as large as you can parallelize. $B \approx B_{\text{noise}}$ is a reasonable default because it roughly doubles compute to halve serial time, a balanced point on the tradeoff curve.

Watch Out

The linear scaling rule is not a law

It is an approximation that holds when the SDE limit is valid, which requires $\eta$ to be small relative to the inverse curvature of the loss. For very large learning rates or very large batch sizes, discrete-time effects (finite step size) break the continuous-time approximation. In practice, the rule works well up to some critical batch size and then fails.

Watch Out

Flat minima are not guaranteed to generalize better

The flat minima hypothesis is plausible but not proven. Reparameterization can change the curvature without changing the function. PAC-Bayes bounds provide some theoretical support (flat minima correspond to wide posteriors with low KL penalty), but the connection is not definitive. Treat the flat minima story as a useful heuristic, not a theorem.

Summary

Batch size controls the noise-to-signal ratio of SGD gradients
The gradient noise scale $B_{\text{noise}}$ is the critical batch size where noise and signal are balanced
Linear scaling rule: scale learning rate proportionally with batch size to preserve dynamics
Small batches inject more noise, which can help escape sharp minima (implicit regularization)
Beyond $B_{\text{noise}}$ , larger batches give diminishing returns in compute efficiency
The connection between flatness and generalization is suggestive but not proven

Exercises

ExerciseCore

Problem

You are training with batch size $B = 64$ and learning rate $\eta = 0.01$ . You want to switch to $B = 256$ . What learning rate should you use according to the linear scaling rule? What quantity is preserved?

ExerciseAdvanced

Problem

The gradient noise scale of your model is $B_{\text{noise}} = 2048$ . You have access to 16 GPUs, each handling a local batch of 128. Your global batch size is $B = 2048$ . A colleague offers you 16 more GPUs. Should you double the global batch size to 4096? Justify using the step count formula.

References

Canonical:

Goyal et al., "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour" (2017)
McCandlish et al., "An Empirical Model of Large-Batch Training" (2018)

Current:

Smith et al., "Don't Decay the Learning Rate, Increase the Batch Size" (ICLR, 2018)
Hoffer et al., "Train Longer, Generalize Better" (NeurIPS, 2017)
Dinh et al., "Sharp Minima Can Generalize for Deep Nets" (ICML, 2017)
Mandt, Hoffman, Blei, "Stochastic Gradient Descent as Approximate Bayesian Inference" (JMLR, 2017)
Chaudhari and Soatto, "Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks" (ICLR, 2018)

Next Topics

Natural extensions from batch size dynamics:

Learning rate schedules: how to adjust learning rate over training, complementary to batch size choices
Distributed training theory: communication costs and gradient compression when parallelizing across many workers

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Adam Optimizerlayer 2 · tier 1
Stochastic Gradient Descent Convergencelayer 2 · tier 1

Derived topics

2

Learning Rate Schedulinglayer 2 · tier 1
Distributed Training Theorylayer 5 · tier 3

Graph-backed continuations

Learning Rate Scheduling Distributed Training Theory