Paper breakdown

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy · 2015 · ICML 2015

Inserts a normalization layer that whitens activations along the batch dimension, then learns a per-channel affine rescaling. Enables higher learning rates, reduces sensitivity to initialization, and was the standard regularizer in deep CNNs for half a decade.

arXiv:1502.03167

Overview

Ioffe and Szegedy (2015) inserted a normalization layer between linear and nonlinear sublayers of a deep network. For each feature dimension and each minibatch, the layer subtracts the batch mean and divides by the batch standard deviation, then applies a learnable affine map. The paper reports a 14× training-time reduction on ImageNet relative to a baseline GoogleNet, with higher final accuracy.

The proposed mechanism — preventing the input distribution to each layer from drifting during training, called "internal covariate shift" — was the paper's headline claim. Subsequent work (Santurkar et al., 2018; Bjorck et al., 2018) showed that the empirical benefits do not actually correlate with reduced covariate shift: networks where covariate shift is artificially injected after batch norm train just as well. The accepted modern explanation is different — batch norm makes the loss surface smoother and effectively rescales gradients per layer — but the architectural change itself stuck.

By 2017 batch norm was the default in almost every CNN. By 2020 transformers had switched to LayerNorm, which avoids the batch-statistics dependence and works in token-by-token decoding. The original paper, however, set the template for normalization layers across deep learning.

Mathematical Contributions

The transformation

Given a minibatch $\mathcal{B} = \{x^{(1)}, \ldots, x^{(m)}\}$ of activations at one feature dimension, batch normalization computes:

$\mu_\mathcal{B} = \frac{1}{m} \sum_{i=1}^m x^{(i)}, \qquad \sigma_\mathcal{B}^2 = \frac{1}{m} \sum_{i=1}^m (x^{(i)} - \mu_\mathcal{B})^2$

then normalizes and rescales:

$\hat x^{(i)} = \frac{x^{(i)} - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}}, \qquad y^{(i)} = \gamma\, \hat x^{(i)} + \beta$

The parameters $\gamma$ and $\beta$ are learned per feature. The constant $\epsilon \approx 10^{-5}$ avoids division by zero. For convolutional layers the same $\gamma, \beta$ are shared across spatial positions and the statistics are averaged across the spatial dimensions as well, so per-feature-map rather than per-pixel.

Learning the affine rescale

The pair $(\gamma, \beta)$ is critical. Without it, the layer would force every activation distribution to mean 0 and variance 1, which destroys the representational capacity of the network — in particular it prevents the layer below from saturating its nonlinearity if that's what the loss prefers. With the affine rescale, the network can recover any output distribution, including the un-normalized pre-norm distribution by setting $\gamma = \sqrt{\sigma_\mathcal{B}^2 + \epsilon}$ and $\beta = \mu_\mathcal{B}$ . Initialization is $\gamma = 1, \beta = 0$ , which starts training in the normalized regime.

Train vs test statistics

At inference time minibatches are not available, so the paper replaces the batch statistics with population estimates accumulated as moving averages during training:

$\bar\mu = \mathbb{E}_\mathcal{B}[\mu_\mathcal{B}], \qquad \bar\sigma^2 = \frac{m}{m-1}\, \mathbb{E}_\mathcal{B}[\sigma_\mathcal{B}^2]$

The factor $m/(m-1)$ is Bessel's correction. Inference becomes a deterministic affine transformation $y = \gamma (x - \bar\mu)/\sqrt{\bar\sigma^2 + \epsilon} + \beta$ , which can be folded into the preceding linear layer at deployment for zero runtime cost.

Backpropagation through the normalization

The Jacobian is non-trivial because $\hat x^{(i)}$ depends on the entire batch through $\mu_\mathcal{B}$ and $\sigma_\mathcal{B}^2$ . The paper writes out the chain rule:

$\frac{\partial \mathcal{L}}{\partial x^{(i)}} = \frac{1}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}}\!\left[\frac{\partial \mathcal{L}}{\partial \hat x^{(i)}} - \frac{1}{m}\sum_{j} \frac{\partial \mathcal{L}}{\partial \hat x^{(j)}} - \frac{\hat x^{(i)}}{m}\sum_{j} \frac{\partial \mathcal{L}}{\partial \hat x^{(j)}} \hat x^{(j)}\right]$

The two correction terms are what couple gradients across the batch — a sample's gradient depends on the batch mean and the batch dot product of the upstream gradient with $\hat x$ . This coupling is the source of the implicit regularization but also of the train/test mismatch under small batches.

What the paper claims it does

Section 3.1 frames the mechanism as reducing "internal covariate shift" — the change in the distribution of layer inputs during training. The argument is that if each layer must keep up with a moving input distribution, learning is harder; freezing the input distribution to mean 0 and variance 1 should help. This explanation has not aged well.

What it actually does

Santurkar et al. (2018) inject controlled covariate shift after batch norm and show training proceeds at the same rate. Their measurement of the loss surface's Lipschitz and smoothness constants is the surviving explanation: batch norm reduces the largest eigenvalue of the loss Hessian by a factor proportional to $\sigma_\mathcal{B}^{-2}$ , which makes larger learning rates safe. Bjorck et al. (2018) trace the same effect to a per-layer rescaling of effective learning rate. Both results are post-hoc justifications, not the original argument.

Connections to TheoremPath Topics

Batch normalization — modern treatment including pre/post-activation placement and the LayerNorm/RMSNorm families.
Weight initialization — He and Glorot variances are derived to keep activation scale stable across depth, which batch norm enforces directly.
Gradient flow and vanishing gradients — the depth-of-training problem batch norm partially addressed before residual connections.
Dropout — the prior dominant regularizer; batch norm partially replaced it in practice.
Convolutional neural networks — the architecture family this paper transformed.
Residual stream and transformer internals — why transformers use LayerNorm and not batch norm.

Why It Matters Now

Batch normalization is the canonical example of a deep learning method that works without the mechanistic story being correct. The paper's covariate-shift framing motivated a generation of follow-on work (LayerNorm, GroupNorm, InstanceNorm, RMSNorm), all of which inherit the architectural pattern — affine rescaling of normalized features — but explicitly reject the batch-statistics dependence.

For modern training the picture has shifted. Transformers use LayerNorm because batch norm breaks under sequence-length variability, distributed training, and small per-device batches. Diffusion models often use GroupNorm. Some 2024-onward LLMs (LLaMA family) use RMSNorm — drop the mean subtraction, normalize only by the root-mean-square, drop $\beta$ , keep $\gamma$ — which is faster and works just as well. None of these would exist without the original paper.

The empirical legacy is also that "throwing in a normalization layer" became a default reflex. If a deep network struggles to train, the first things to try are residual connections and a normalization layer; this paper is the second of those.

References

Canonical:

Ioffe, S., & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." ICML. arXiv:1502.03167.

Critique:

Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). "How Does Batch Normalization Help Optimization?" NeurIPS. arXiv:1805.11604. Shows the covariate-shift explanation is wrong; the surviving mechanism is loss-surface smoothing.
Bjorck, J., Gomes, C., Selman, B., & Weinberger, K. Q. (2018). "Understanding Batch Normalization." NeurIPS. arXiv:1806.02375. Per-layer effective learning-rate rescaling.

Direct descendants:

Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). "Layer Normalization." arXiv:1607.06450. Normalize across features instead of across the batch — required by transformers.
Wu, Y., & He, K. (2018). "Group Normalization." ECCV. arXiv:1803.08494. Channel groups; works at small batch size.
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). "Instance Normalization: The Missing Ingredient for Fast Stylization." arXiv:1607.08022. Per-sample, per-channel.
Zhang, B., & Sennrich, R. (2019). "Root Mean Square Layer Normalization." NeurIPS. arXiv:1910.07467. Drops mean centering; the LLaMA/Mistral default.

Follow-on architectural work:

He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." CVPR. arXiv:1512.03385. Solves the depth problem more decisively than batch norm did.
Brock, A. et al. (2021). "High-Performance Large-Scale Image Recognition Without Normalization." ICML. arXiv:2102.06171. NF-Nets: ImageNet at the top of the leaderboard with no normalization layer at all.

Standard textbook:

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Section 8.7.1.
Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Section 11.4.

Connected topics

Last reviewed: May 5, 2026