Curriculum Learning

Sneiderman, Robby

Training Techniques

Curriculum Learning

Train on easy examples first, gradually increase difficulty. Curriculum learning can speed convergence and improve generalization, but defining difficulty is the hard part. Self-paced learning, anti-curriculum, and the connection to importance sampling.

CoreTier 3StableSupporting~35 min

Prerequisites

Synthetic Data Distillation

Prereq Map

Why This Matters

Humans learn arithmetic before calculus. We learn letters before words, words before sentences. The order in which examples are presented affects learning speed and final performance.

Curriculum learning applies this principle to machine learning: present training examples in order of increasing difficulty. Bengio et al. (2009) showed that this can both speed up convergence and lead to better local optima. The idea is simple. The difficulty is defining "easy" and implementing the schedule.

Formal Setup

Definition

Curriculum

A curriculum is a sequence of distributions $D_1, D_2, \ldots, D_T$ over training examples, where the support of $D_t$ is typically a subset of the full training set, and difficulty increases with $t$ . The final distribution $D_T$ is the uniform distribution over all training examples.

Formally, let $w(x, t) \geq 0$ be a weighting function over examples $x$ at step $t$ . The curriculum defines a weighted empirical risk:

$\hat{R}_t(h) = \frac{1}{n} \sum_{i=1}^{n} w(x_i, t) \cdot \ell(h(x_i), y_i)$

A curriculum schedules $w(x_i, t)$ so that easy examples have higher weight early in training.

Definition

Difficulty Score

A difficulty score $d(x_i)$ assigns a scalar to each training example measuring how hard it is to learn. Common choices:

Loss-based: $d(x_i) = \ell(h_0(x_i), y_i)$ where $h_0$ is an initial or pretrained model
Confidence-based: $d(x_i) = 1 - p(y_i \mid x_i; h_0)$
Human-defined: annotator agreement, label noise estimates
Data complexity: input length, number of objects in an image

Why Curricula Can Help

Heuristic: Curriculum as Continuation Method

Bengio et al. (2009) sketches an informal analogy between curriculum learning and continuation (homotopy) methods for non-convex optimization. This is a heuristic argument, not a theorem. No assumption-to-conclusion guarantee is proved.

The analogy: consider a sequence of objectives $L_1, L_2, \ldots, L_T = L$ where $L_t$ is a smoothed version of the full loss using only easy examples. Define $L_\lambda = (1-\lambda) L_{\text{easy}} + \lambda L_{\text{full}}$ and increase $\lambda$ from 0 to 1. If each $L_t$ has a simpler landscape than $L_{t+1}$ and the minimizer of $L_t$ lies in the basin of attraction of a good minimizer of $L_{t+1}$ , tracking the trajectory of minimizers $w^*(\lambda)$ may land in a better region of $L$ than random initialization.

Two things to note. First, the continuous-trajectory claim is only suggestive: curricula in practice use discrete weight changes, stochastic optimization, and non-smooth losses, so homotopy continuation is a picture rather than a proof. Second, the assumption that easy-example basins contain full-loss optima is not generally true. If easy and hard examples require qualitatively different features, the curriculum can guide the model to a minimum that is good for easy examples but poor overall. Empirical results on curricula vary across tasks; treat the continuation-method framing as motivation, not as an existence result.

Self-Paced Learning

Kumar, Packer, and Koller (2010): instead of fixing the curriculum in advance, let the model decide what is easy. At each step, include examples that the current model can handle (low loss) and exclude examples it cannot.

Proposition

Self-Paced Learning as Joint Optimization

Statement

Kumar, Packer, Koller (2010) define self-paced learning (binary variant) as

$\min_{w,\, v \in \{0,1\}^n} \sum_{i=1}^{n} v_i \ell(h_w(x_i), y_i) - \lambda \sum_{i=1}^{n} v_i,$

where $v_i \in \{0,1\}$ are hard selection indicators and $\lambda > 0$ is a pace parameter. For fixed $w$ , the closed form is $v_i^* = \mathbb{1}[\ell_i \leq \lambda]$ . Examples with loss at or below $\lambda$ are included; the rest are excluded. Increasing $\lambda$ over training admits harder examples.

Jiang et al. (2014, 2015) propose soft variants that relax $v_i \in [0,1]$ and replace the linear reward $-\lambda v_i$ with a regularizer $f(v_i; \lambda)$ . Two common choices:

Linear soft regularizer $f(v_i; \lambda) = \tfrac{1}{2}\lambda(v_i^2 - 2v_i)$ gives $v_i^* = \max(0,\, 1 - \ell_i / \lambda)$ .
Logarithmic regularizer $f(v_i; \lambda, \zeta) = \zeta v_i - \zeta^{v_i}/\log\zeta$ (with $\zeta \in (0,1)$ ) yields a smoother weight.

Soft variants interpolate between full inclusion and exclusion and track per-example confidence rather than a hard threshold.

Intuition

The $-\lambda v_i$ term rewards inclusion (without it, $v_i = 0$ trivially minimizes the objective). In the binary case this yields a hard threshold at $\lambda$ . In the soft case the optimal weight decreases smoothly from 1 at $\ell_i = 0$ to 0 at $\ell_i = \lambda$ , giving graded confidence.

Proof Sketch

Binary case: for fixed $w$ , the objective decomposes across $i$ as $v_i(\ell_i - \lambda)$ . On $v_i \in \{0,1\}$ , the minimum is $v_i = 1$ when $\ell_i \leq \lambda$ and $v_i = 0$ otherwise.

Linear soft case: for fixed $w$ , minimize $v_i \ell_i + \tfrac{1}{2}\lambda(v_i^2 - 2v_i)$ over $v_i \in [0,1]$ . The first-order condition gives $v_i = 1 - \ell_i / \lambda$ , clipped to $[0,1]$ .

Why It Matters

Self-paced learning removes the need for external difficulty labels. The model's own loss serves as the difficulty measure. This makes curriculum learning applicable even when human difficulty labels are unavailable.

Failure Mode

Self-paced learning can ignore hard but important examples indefinitely if $\lambda$ increases too slowly. It can also create a feedback loop: the model never trains on hard examples, so they always have high loss, so they are never included.

report a correction →

Anti-Curriculum: Hard Examples First

Bengio et al. (2009) found curricula generally help, but subsequent work showed exceptions. Presenting hard or diverse examples first can sometimes work better, especially when:

Hard examples contain the most information about decision boundaries.
Easy examples are redundant (the model learns them quickly regardless).
The difficulty ordering is inaccurate.

This connects to importance sampling: upweighting examples with high loss (hard examples) reduces the variance of the gradient estimator. The optimal importance sampling distribution is proportional to the per-example gradient norm, which correlates with difficulty.

Connection to Importance Sampling

The gradient of the empirical risk is $\nabla L = \frac{1}{n} \sum_i \nabla \ell_i$ . If we sample examples with probability $p_i \propto \|\nabla \ell_i\|$ instead of uniformly, the variance of the stochastic gradient decreases.

This is the opposite of curriculum learning: importance sampling upweights hard examples (high gradient norm) while curriculum learning downweights them. The resolution is that they solve different problems. Curriculum learning addresses non-convex optimization (finding good basins), while importance sampling addresses stochastic gradient variance (faster convergence within a basin).

Common Confusions

Watch Out

Curriculum learning is not the same as data augmentation

Data augmentation adds modified copies of examples. Curriculum learning changes the order or weighting of existing examples. They are complementary and can be combined.

Watch Out

Shuffling is not a curriculum

Standard training shuffles data randomly each epoch. A curriculum is a deliberate ordering. Random shuffling is the control against which curriculum benefits are measured.

Modern Directions

RHO-Loss (Mindermann et al. 2022). Prioritizes points that are learnable, worth learning, and not yet learnt. The selection score uses the reducible holdout loss: training loss minus a held-out irreducible loss estimate. This skips noisy and already-learned points and outperforms uniform sampling on large web-scale datasets.

Automated curricula (Graves et al. 2017). Treat task or data-slice selection as a multi-armed bandit where the reward is learning progress (loss decrease or complexity gain). A policy over curriculum tasks is learned online, avoiding a fixed hand-tuned schedule.

Data Shapley (Ghorbani and Zou 2019). Assigns each training point a Shapley value that measures its marginal contribution to model performance. Negative-value points are noisy or harmful; removing them acts as an offline curriculum. Cost is exponential in the number of points without approximation.

LLM pretraining curricula. Modern pretraining uses data-mix curricula rather than per-example ordering. Source weights, quality filters, and annealing schedules shape what the model sees. DeepSeek-V3 details multi-stage data mixes; Phi models (Gunasekar et al. 2023) train on small synthetic "textbook" data; DataComp-LM (Li et al. 2024) benchmarks curation strategies. These treat the full pretraining corpus as the unit of curriculum design.

Focal loss and hard-example mining. Focal loss (Lin et al. 2017) uses $FL(p_t) = -(1 - p_t)^\gamma \log p_t$ to down-weight easy, high- confidence examples, which is an inverse-curriculum principle: train more on hard points, not easy ones. Hard negative mining in contrastive learning (Robinson et al. 2021) similarly reweights toward hard negatives. The inverse-curriculum intuition aligns with the importance sampling view in the next section.

Why Curriculum Learning Is Underused

Despite positive results, curriculum learning is rare in practice because:

Difficulty is hard to define: loss-based difficulty changes during training, human annotations are expensive, data complexity metrics are domain-specific.
Hyperparameter sensitivity: the pace schedule (how fast to add harder examples) requires tuning.
Modern regularization works well: dropout, data augmentation, and large batch training often provide sufficient generalization without curricula.
Conflicting evidence: results vary across tasks and architectures, with no clear consensus on when curricula help.

Exercises

ExerciseCore

Problem

In self-paced learning with pace parameter $\lambda = 0.5$ , which examples are included if the per-example losses are $\ell = (0.1, 0.8, 0.3, 0.6, 0.2)$ ?

ExerciseAdvanced

Problem

Explain why curriculum learning and importance sampling give opposite prescriptions for example weighting. Under what conditions would you prefer each approach?

References

Canonical:

Bengio, Louradour, Collobert, Weston, "Curriculum Learning", ICML 2009 (Section 4 for continuation-method analogy, informal)
Kumar, Packer, Koller, "Self-Paced Learning for Latent Variable Models", NeurIPS 2010 (binary variant, Equation 4)
Jiang, Meng, Zhao, Shan, Hauptmann, "Self-Paced Learning with Diversity", NeurIPS 2014 (soft variant with diversity regularizer)
Jiang, Zhou, Leung, Li, Hauptmann, "Self-Paced Curriculum Learning", AAAI 2015 (general self-paced regularizer family)

Modern methods:

Mindermann, Brauner, Razzak, Sharma, Kirsch, Xu, Holtgen, Gomez, Morisot, Farquhar, Gal, "Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt" (RHO-Loss), ICML 2022
Graves, Bellemare, Menick, Munos, Kavukcuoglu, "Automated Curriculum Learning for Neural Networks", ICML 2017
Ghorbani, Zou, "Data Shapley: Equitable Valuation of Data for Machine Learning", ICML 2019
Lin, Goyal, Girshick, He, Dollar, "Focal Loss for Dense Object Detection", ICCV 2017 (inverse-curriculum via $(1-p_t)^\gamma$ weighting)
Robinson, Chuang, Sra, Jegelka, "Contrastive Learning with Hard Negative Samples", ICLR 2021

LLM pretraining curricula:

Gunasekar et al., "Textbooks Are All You Need" (Phi-1), 2023
Li et al., "DataComp-LM: In Search of the Next Generation of Training Sets for Language Models", 2024
DeepSeek-AI, "DeepSeek-V3 Technical Report", 2024 (data-mix annealing, Section 3)

Surveys:

Soviany, Ionescu, Rota, Sebe, "Curriculum Learning: A Survey", IJCV 2022
Katharopoulos, Fleuret, "Not All Samples Are Created Equal: Deep Learning with Importance Sampling", ICML 2018

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Synthetic Data Distillationlayer 3 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.