ML Methods
Loss Functions Catalog
A systematic catalog of loss functions for classification, regression, and representation learning: cross-entropy, MSE, hinge, Huber, focal, KL divergence, and contrastive loss.
Prerequisites
Why This Matters
The loss function defines what "good" means for your model. Two models with identical architectures trained with different loss functions will learn different things. In many practical settings, switching the loss function improves performance more than switching the architecture. The choice of loss encodes your assumptions about the problem: noise distribution, outlier sensitivity, class balance, and what errors cost. See cross-entropy loss deep dive for a detailed treatment of the most common classification loss.
From the convex optimization perspective, convexity of the loss in the parameters matters: it determines whether local minima are global, and whether subgradient methods converge. Most standard loss functions (cross-entropy, MSE, Huber, hinge) are convex in the predictions, but the composed loss through a neural network is non-convex in the weights.
Mental Model
A loss function measures the cost of predicting when the truth is . Different losses penalize different types of errors. MSE penalizes large errors quadratically, making it the natural choice for linear regression under Gaussian noise. MAE penalizes all errors linearly (robust to outliers). Cross-entropy penalizes confident wrong classification predictions severely. The right loss depends on what errors matter in your application.
Classification Losses
Cross-Entropy Loss
For a classification problem with classes, the cross-entropy loss for a single example with true label (one-hot encoded) and predicted probabilities is:
For binary classification with and predicted probability :
Cross-entropy has a critical property: as for the true class, the loss goes to infinity. This severe penalty for confident wrong predictions drives the model to assign high probability to the correct class.
Hinge Loss
For binary classification with and raw prediction :
The loss is zero when (correct prediction with margin at least 1). This is the loss used by support vector machines.
Hinge loss does not require probability outputs and is not differentiable at . In practice, subgradient methods handle the non-differentiability.
Focal Loss
For binary classification with true class probability (the model's predicted probability for the true class):
where is a focusing parameter. When , this reduces to cross-entropy.
Focal loss down-weights easy examples (where is high). For , an example with gets weight , while an example with gets weight . This concentrates learning on hard examples, which is critical for class-imbalanced problems like object detection where 99%+ of candidates are background.
Regression Losses
Mean Squared Error
For a regression problem with prediction and target :
MSE is the maximum likelihood estimator under a Gaussian noise model: where .
Huber Loss
For a threshold :
where is the residual. Huber loss is quadratic for small errors and linear for large errors.
Huber loss combines the benefits of MSE (smooth, efficient for Gaussian errors) and MAE (robust to outliers). The parameter controls the transition. When is large, Huber approaches MSE. When is small, it approaches MAE.
Divergence-Based Losses
KL Divergence Loss
The Kullback-Leibler divergence from distribution to distribution is:
KL divergence is non-negative ( by Gibbs' inequality) and equals zero if and only if . It is not symmetric: in general.
KL divergence is used in knowledge distillation (matching a student's output distribution to a teacher's), variational autoencoders (regularizing the latent distribution toward a prior), and reinforcement learning from human feedback (penalizing deviation from a reference policy).
Contrastive Loss
For a pair of examples with label indicating whether they are similar:
where is a distance function and is a margin. Similar pairs are pulled together; dissimilar pairs are pushed apart until they are at least distance apart.
Main Theorems
Cross-Entropy Minimization Equals Maximum Likelihood
Statement
For a model parameterized by that outputs class probabilities , minimizing the average cross-entropy loss on a dataset is equivalent to maximizing the log-likelihood:
Intuition
Cross-entropy loss for a one-hot target is just : the negative log-probability of the true class. Summing over examples gives the negative log-likelihood. Minimizing the negative is maximizing the positive.
Proof Sketch
Expand the cross-entropy: . For a one-hot label where and for , this simplifies to . Sum over the dataset and negate.
Why It Matters
This equivalence connects two perspectives: the information-theoretic view (cross-entropy measures how many extra bits your model needs) and the statistical view (maximum likelihood is the optimal estimator under regularity conditions). It justifies using cross-entropy as the default classification loss.
Failure Mode
The equivalence holds only when the model outputs valid probability distributions (non-negative, sum to 1). If the model is miscalibrated (probabilities do not reflect true frequencies), cross-entropy still works for discrimination but the probabilistic interpretation breaks down.
Huber Loss Bounded Influence Function
Statement
The influence function of the Huber loss estimator is bounded: for any observation ,
where . In contrast, the influence function of MSE is unbounded: , which grows without limit.
Intuition
MSE's gradient is proportional to the residual, so a single outlier with residual 1000 exerts 1000x more influence than a typical point. Huber's gradient is capped at , so no single point can dominate the gradient regardless of how far it is from the prediction.
Proof Sketch
The gradient of Huber loss is for and for . The maximum absolute value is , achieved for all .
Why It Matters
In real datasets, outliers are common (mislabeled examples, sensor errors, data entry mistakes). Huber loss provides a principled way to limit their influence without requiring explicit outlier removal. The parameter controls the tradeoff: smaller means more robustness but less statistical efficiency under Gaussian noise.
Failure Mode
Huber loss is robust to outliers in the target , not in the input . A leverage point (outlier in input space) can still distort the fit. For robustness to both, you need methods from robust regression (e.g., M-estimators with bounded leverage).
Gradient Expressions
For gradient descent and backpropagation, what matters is the gradient of the loss with respect to the prediction (or raw logit before the activation).
Cross-entropy (binary, sigmoid output): Let be the pre-sigmoid logit and . Then:
The gradient is simply the prediction error. This clean form is why cross-entropy pairs naturally with sigmoid: the sigmoid's derivative cancels against the log's derivative.
MSE:
Gradient is proportional to the residual; large residuals dominate.
Huber loss:
Gradient is capped at , so outliers exert bounded influence on the update.
MAE:
Gradient is constant magnitude regardless of residual size. This makes MAE robust but can cause slow convergence near the optimum (no gradient decay).
Hinge loss:
Zero gradient when the margin condition is satisfied; the loss only responds to margin violations.
KL divergence (as is being optimized to match ):
This is why minimizing KL divergence is mode-seeking: the gradient blows up when for a region where , forcing to cover all modes of .
Interactive: loss and gradient side by side
Every row of the table below collapses two shapes into one name. The loss curve says how much a residual or margin costs; the gradient curve says what the optimizer will do about it. Toggle overlays to compare, and slide to watch Huber interpolate between MSE (quadratic near zero) and MAE (linear in the tails).
Loss Function Comparison Table
| Loss | Convex? | Smooth? | Robust to outliers? | Probabilistic | Typical use |
|---|---|---|---|---|---|
| Cross-entropy | Yes | Yes | No | Yes | Classification |
| Binary cross-entropy | Yes | Yes | No | Yes | Binary classification |
| MSE | Yes | Yes | No | Yes (Gaussian) | Regression, linear models |
| MAE | Yes | No (at 0) | Yes | Yes (Laplace) | Regression, robust fitting |
| Huber | Yes | Yes | Yes (bounded) | Approximate | Regression with outliers |
| Hinge | Yes | No (at 1) | Partially | No | SVM, max-margin classifiers |
| Focal | No (in ) | Yes | Partially | Yes | Imbalanced classification |
| KL divergence | Yes (in ) | Yes | No | Yes | Distillation, VAEs, RLHF |
| Contrastive | No | Partially | Partially | No | Metric learning |
Notes on the table:
- "Smooth" means differentiable everywhere. MAE and hinge have kink points requiring subgradients.
- "Robust to outliers" means the influence function is bounded. MSE and cross-entropy have unbounded influence.
- Focal loss is convex in the prediction logit for some values but not all; treat as non-convex in general.
- "Probabilistic" means the loss corresponds to a negative log-likelihood for some distribution.
When-to-Use Decision Guide
Classification:
- Start with cross-entropy. It is the MLE loss for the categorical distribution and pairs cleanly with softmax.
- Switch to focal loss when positive examples are fewer than 1% of the data (object detection, rare event detection).
- Use hinge loss only when you specifically want maximum-margin geometry (SVM-based models or structured prediction).
Regression:
- Start with MSE when you expect Gaussian noise and have no clear outliers.
- Switch to Huber when residuals have heavy tails or when mislabeled examples are present. Tune to the scale of typical residuals.
- Use MAE when the median (not mean) is the target quantity, or when the noise is Laplace-distributed.
Distribution matching / generative modeling:
- Use KL divergence for knowledge distillation (forward KL: mean-seeking, covers all modes of the teacher) or for RLHF penalty terms.
- Use reverse KL in variational inference (reverse KL: mode-seeking, latches onto one mode of the posterior).
Metric / representation learning:
- Use contrastive loss or triplet loss for embedding-space problems where class boundaries are not pre-defined.
- InfoNCE (used in CLIP, SimCLR) is a contrastive loss that has been shown to maximize a lower bound on mutual information.
Why Loss Choice Matters More Than Architecture
For a fixed architecture, the loss function determines what the model optimizes. Concrete examples:
- Object detection with cross-entropy treats all misclassifications equally. With focal loss, the model focuses on hard negatives and achieves significantly higher mAP.
- Regression with MSE on heavy-tailed data produces estimates pulled toward outliers. Switching to Huber or MAE can reduce test error by 20%+ without changing the model.
- Knowledge distillation with hard labels (cross-entropy on argmax) loses information. Soft labels with KL divergence preserve the teacher's inter-class relationships.
Common Confusions
Cross-entropy and log loss are the same thing
In the binary case, cross-entropy loss and log loss (logistic loss) are identical: . In the multi-class case, "log loss" typically refers to the same formula as multi-class cross-entropy. The terms are interchangeable.
KL divergence is not a distance
and KL divergence does not satisfy the triangle inequality. It is a divergence, not a metric. The direction matters: penalizes places where but (mode-seeking when optimizing ), while does the reverse (mean-seeking).
Hinge loss does not produce probability estimates
Unlike cross-entropy, hinge loss does not require or produce probability outputs. An SVM's raw output is a signed distance from the decision boundary, not a probability. To get probabilities from an SVM, you need Platt scaling as a post-processing step.
Summary
- Cross-entropy = negative log-likelihood for classification; the default choice
- MSE assumes Gaussian noise; use Huber or MAE when outliers are present
- Focal loss addresses class imbalance by down-weighting easy examples
- Hinge loss creates maximum-margin classifiers (SVMs)
- KL divergence measures distributional mismatch; critical for distillation and VAEs
- Contrastive loss learns representations by comparing pairs
- The choice of loss encodes assumptions about noise, class balance, and error costs
Exercises
Problem
Compute the cross-entropy loss for a 3-class problem where the true label is class 2 (zero-indexed) and the model predicts .
Problem
For Huber loss with , compute the loss for residuals , , and . Compare with MSE for the same residuals.
Problem
Show that focal loss with reduces to cross-entropy, and explain why increasing concentrates the loss on hard examples. Compute the ratio of focal loss at to focal loss at for and .
References
Canonical:
- Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4.3: cross-entropy, logistic regression, softmax; Chapter 7.1: SVM and hinge loss derivation
- Huber, "Robust Estimation of a Location Parameter" (1964), Annals of Mathematical Statistics: original Huber loss paper and the bounded-influence philosophy
- Hastie, Tibshirani & Friedman, Elements of Statistical Learning (2009), Chapter 10: loss functions for boosting; Chapter 11: MSE, MAE, and their robustness properties
Current:
- Lin et al., "Focal Loss for Dense Object Detection" (2017), ICCV. arXiv:1708.02002: introduces focal loss and the class-imbalance argument
- Khosla et al., "Supervised Contrastive Learning" (2020), NeurIPS. arXiv:2004.11362: extends contrastive loss to supervised setting
- Hadsell, Chopra & LeCun, "Dimensionality Reduction by Learning an Invariant Mapping" (2006), CVPR: original contrastive loss formulation
- Kullback & Leibler, "On Information and Sufficiency" (1951): original KL divergence paper and its connection to maximum likelihood
Last reviewed: April 14, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Logistic Regressionlayer 1 · tier 1
Derived topics
2- Empirical Risk Minimizationlayer 2 · tier 1
- No-Free-Lunch Theoremlayer 2 · tier 2
Graph-backed continuations