Training Techniques
Label Smoothing and Regularization
Replace hard targets with soft targets to prevent overconfidence: the label smoothing formula, its connection to maximum entropy regularization, calibration effects, and when it hurts.
Prerequisites
Why This Matters
A classifier trained with standard cross-entropy loss on one-hot labels is incentivized to push logits toward infinity: the loss decreases as the predicted probability of the correct class approaches 1. This produces overconfident predictions where the model assigns near-zero probability to all incorrect classes, even when the true label is ambiguous.
Label smoothing is a simple fix: replace the one-hot target with a soft target that reserves some probability mass for incorrect classes. This single change improves calibration, reduces overfitting (complementing other techniques like dropout and batch normalization), and often improves test accuracy.
Formal Setup
Hard Target
For a -class classification problem, the standard one-hot target for class is:
That is, and every incorrect class receives probability .
Label-Smoothed Target
For smoothing parameter , the label-smoothed target is:
For the correct class : . For incorrect classes: .
The cross-entropy loss with label-smoothed targets is:
Main Theorem
Label Smoothing as KL Regularization
Statement
The label-smoothed cross-entropy loss decomposes as:
where is the standard cross-entropy with the hard label, is the cross-entropy with the uniform distribution , and is the model's softmax output.
Equivalently:
The term (where is the KL divergence) penalizes the model for being far from the uniform distribution, encouraging higher entropy in predictions.
Intuition
Label smoothing adds a penalty for overconfidence. The term is minimized when is uniform (maximum uncertainty). The standard loss term pushes toward the correct class. The balance between these two forces prevents the model from pushing logits to extreme values.
Proof Sketch
Expand . Substitute into using the definition of KL divergence.
Why It Matters
This decomposition reveals that label smoothing is equivalent to standard training plus a maximum entropy regularizer. The strength of regularization is controlled by . Typical values are . The original Inception v2 paper (Szegedy et al., 2016) used and the original Transformer paper (Vaswani et al., 2017) used .
Failure Mode
Label smoothing assumes all incorrect classes are equally plausible (the is uniform across wrong classes). When there is strong class hierarchy (e.g., misclassifying a dog as a cat is more reasonable than misclassifying a dog as a truck), uniform smoothing wastes probability mass on implausible classes. Non-uniform smoothing based on class similarity can help but requires additional information.
Effects on Calibration
A model is well-calibrated if and only if its predicted probabilities match empirical frequencies: when it says "80% probability of class A," class A occurs about 80% of the time.
Standard training with hard labels produces overconfident models: predicted probabilities are too close to 0 and 1. Label smoothing reduces overconfidence because the smoothed loss can never reach zero: its minimum at is the entropy of the smoothed target. (A model that still outputs the one-hot prediction does worse, not better, under smoothing: assigning to incorrect classes makes the term blow up.) For small and moderate , , but the loss floor is exactly, not .
Muller et al. (2019) showed that label smoothing improves calibration (lower expected calibration error) but introduces a subtle bias: the penultimate layer representations become more clustered. This clustering can hurt knowledge distillation because the teacher's soft outputs contain less inter-class information.
Effect on Logit Magnitudes
Without label smoothing, minimizing cross-entropy drives the logit of the correct class toward and all other logits toward . The loss approaches zero only in this limit.
With label smoothing at parameter , the optimal logits satisfy:
This is finite. The gap between the correct-class logit and other logits is bounded by a quantity that depends on and . For and : the optimal gap is , which is moderate.
When NOT to Use Label Smoothing
Calibrated probability estimates are required. If downstream decisions depend on the exact predicted probability (e.g., medical diagnosis thresholds, betting markets), label smoothing changes the probability scale in ways that require recalibration. The model no longer tries to output true probabilities but instead targets a smoothed version.
Knowledge distillation from the model. Muller et al. (2019) found that label-smoothed teachers produce worse students than hard-label teachers. The clustering of penultimate features reduces the information content of soft targets.
Extreme class imbalance. With classes and , each wrong class gets probability from smoothing. This negligible amount provides no regularization benefit for rare classes while still degrading the signal for common classes.
Common Confusions
Label smoothing is not the same as mixup
Mixup creates new training examples by interpolating between input-label pairs: . The soft labels in mixup reflect actual interpolation of inputs. Label smoothing applies a fixed softening to every example regardless of the input. Mixup is data augmentation; label smoothing is regularization.
Label smoothing does not change the argmax
The optimal prediction under label smoothing still assigns the highest probability to the correct class. Smoothing only prevents the model from being infinitely confident. The ranking of classes is preserved; only the magnitude of probabilities changes.
Canonical Examples
Effect on a 3-class problem
With and , the target for class 1 changes from to . The cross-entropy loss for a model predicting : standard loss is ; smoothed loss is . The smoothed loss is higher because it also penalizes the low probability assigned to incorrect classes.
Exercises
Problem
For classes and , what is the label-smoothed target vector for class 3? What is the maximum achievable smoothed cross-entropy loss for a perfect predictor that always outputs the label-smoothed target?
Problem
Prove that the optimal softmax output minimizing the label-smoothed loss satisfies for all . What does this ratio approach as ?
References
Canonical:
- Szegedy et al., "Rethinking the Inception Architecture" (CVPR 2016), Section 7. Introduces label smoothing as an Inception-v2 regularizer with .
- Vaswani et al., "Attention Is All You Need" (NeurIPS 2017), Section 5.4. Transformer training with ; notes hurts perplexity, helps accuracy and BLEU.
- Pereyra, Tucker, Chorowski, Kaiser, Hinton, "Regularizing Neural Networks by Penalizing Confident Output Distributions" (ICLR Workshop 2017, arXiv:1701.06548). Formal confidence-penalty / max-entropy framing that label smoothing implements.
Current:
- Muller, Kornblith, Hinton, "When Does Label Smoothing Help?" (NeurIPS 2019, arXiv:1906.02629). Calibration improves; penultimate features cluster; distillation from a smoothed teacher is worse.
- Guo, Pleiss, Sun, Weinberger, "On Calibration of Modern Neural Networks" (ICML 2017, arXiv:1706.04599). Temperature scaling as a post-hoc calibration alternative to label smoothing.
- Lin, Goyal, Girshick, He, Dollar, "Focal Loss for Dense Object Detection" (ICCV 2017, arXiv:1708.02002). Alternative cross-entropy modification via a focusing factor for class-imbalanced settings where label smoothing is ineffective.
Next Topics
Label smoothing connects to the broader study of regularization, calibration, and training techniques for neural networks.
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Logistic Regressionlayer 1 · tier 1
Derived topics
0No published topic currently declares this as a prerequisite.