Skip to main content

ML Methods

Contrastive Learning

Learning representations by pulling positive pairs together and pushing negative pairs apart, with theoretical grounding in mutual information maximization.

AdvancedTier 2CurrentSupporting~55 min

Why This Matters

Contrastive learning is the dominant paradigm for self-supervised representation learning. It produces representations competitive with supervised pretraining while using no labels. The key models in this space (CLIP, SimCLR, MoCo) all use contrastive objectives. Understanding the theory behind contrastive learning explains why augmentation choices matter, why large batch sizes help, and what the loss function actually optimizes.

The Contrastive Setup

Given an input xx, create two views x+x^+ (positive) by applying random augmentations. Sample N1N-1 other inputs {x1,,xN1}\{x_1^-, \ldots, x_{N-1}^-\} as negatives. Learn an encoder ff such that f(x)f(x) and f(x+)f(x^+) are close while f(x)f(x) and f(xi)f(x_i^-) are far apart.

Definition

Positive Pair

Two views (xi,xj)(x_i, x_j) form a positive pair if and only if they are derived from the same underlying data point (e.g., two augmentations of the same image). The contrastive objective trains the encoder to map positive pairs to nearby points in representation space.

Definition

InfoNCE Loss

For anchor zz, positive z+z^+, and N1N-1 negatives {z1,,zN1}\{z_1^-, \ldots, z_{N-1}^-\} where z=f(x)z = f(x):

LInfoNCE=logexp(sim(z,z+)/τ)exp(sim(z,z+)/τ)+j=1N1exp(sim(z,zj)/τ)\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(z, z^+)/\tau)}{\exp(\text{sim}(z, z^+)/\tau) + \sum_{j=1}^{N-1} \exp(\text{sim}(z, z_j^-)/\tau)}

Here sim(,)\text{sim}(\cdot,\cdot) is cosine similarity and τ>0\tau > 0 is a temperature parameter. This is a softmax cross-entropy over NN options where the correct answer is the positive pair.

Core Theory

Theorem

InfoNCE as Mutual Information Lower Bound

Statement

The InfoNCE loss with NN negative samples satisfies:

I(X;X+)logNLInfoNCEI(X; X^+) \geq \log N - \mathcal{L}_{\text{InfoNCE}}

where I(X;X+)I(X; X^+) is the mutual information between the anchor and positive views. Minimizing InfoNCE maximizes a lower bound on mutual information.

Intuition

InfoNCE is an NN-way classification problem: identify the positive among NN candidates. The logNLInfoNCE\log N - \mathcal{L}_{\text{InfoNCE}} quantity is a lower bound on I(X;X+)I(X; X^+), and perfect classification (loss =0= 0) certifies at least logN\log N bits of shared information. The representation itself can encode strictly more MI than logN\log N; the bound just cannot witness it without more negatives.

Proof Sketch

Write the InfoNCE objective as a density ratio estimation problem. The optimal critic is sim(z,z+)/τlogp(z+z)p(z+)\text{sim}(z, z^+)/\tau \propto \log \frac{p(z^+ \mid z)}{p(z^+)}. Apply the variational bound on mutual information from Barber and Agakov (2003). The logN\log N term appears because the bound saturates: you cannot extract more than logN\log N bits from an NN-way classification.

Why It Matters

This explains two empirical observations. First, larger batch sizes (more negatives) give a tighter MI lower bound — logN\log N caps how much MI the InfoNCE estimator can certify, not how much MI the encoder can actually capture. So small NN both loosens the bound and weakens the gradient signal at high MI. Second, the quality of augmentations matters because they determine what information is shared between positive pairs and thus what I(X;X+)I(X; X^+) contains.

Failure Mode

The bound becomes loose when the true MI is much larger than logN\log N. With N=256N = 256, the InfoNCE estimator can only certify up to log2565.5\log 256 \approx 5.5 nats, even though the encoder may carry far more shared information. If the task requires recovering more MI than the bound can witness, increasing NN tightens the bound and sharpens the gradient. Also, the bound says nothing about which information is captured; bad augmentations can make the model learn shortcuts.

Proposition

Alignment and Uniformity Decomposition

Statement

The contrastive loss decomposes into two competing objectives:

Alignment: E(x,x+)pposf(x)f(x+)2\mathbb{E}_{(x,x^+) \sim p_{\text{pos}}} \|f(x) - f(x^+)\|^2

Uniformity: logE(x,x)pdata2e2f(x)f(x)2\log \mathbb{E}_{(x,x') \sim p_{\text{data}}^2} e^{-2\|f(x) - f(x')\|^2}

Good contrastive representations minimize alignment (positive pairs are close) while minimizing uniformity (all representations spread evenly on the hypersphere).

Intuition

Alignment alone leads to collapse (map everything to one point). Uniformity alone leads to random representations. The contrastive loss balances both: pull positives together, spread everything else uniformly.

Proof Sketch

Decompose the InfoNCE gradient into a term pulling positive pairs together (alignment) and a term pushing random pairs apart (uniformity). On the unit sphere, the uniform distribution maximizes entropy, and the repulsive term in InfoNCE pushes toward this maximum entropy distribution.

Why It Matters

This decomposition explains why contrastive methods avoid representation collapse without explicit mechanisms like stop-gradients (used in BYOL/SimSiam). The negative samples provide the uniformity pressure that prevents collapse.

Failure Mode

If the number of negatives is too small, uniformity pressure is weak and representations can partially collapse (cluster into a few modes instead of spreading uniformly). Temperature τ\tau also mediates this tradeoff: very small τ\tau overweights hard negatives at the expense of uniformity.

Key Architectures

SimCLR

SimCLR (Chen et al., 2020) applies two random augmentations to each image in a batch of size BB, producing 2B2B views. Every other augmented view in the batch serves as a negative. The loss uses cosine similarity with temperature τ=0.5\tau = 0.5 and a 2-layer MLP projection head on top of the encoder.

Key finding: the projection head is critical. Representations before the projection head transfer better than those after it. The projection head discards information about augmentations that is useful for downstream tasks.

MoCo

MoCo (He et al., 2020) decouples the number of negatives from batch size using a momentum-updated encoder and a queue of past representations. The key encoder is updated by gradient descent; the momentum encoder tracks it via exponential moving average: θkmθk+(1m)θq\theta_k \leftarrow m \theta_k + (1-m) \theta_q with m=0.999m = 0.999. This allows a large, consistent negative set (65536 negatives) with normal batch sizes.

CLIP: Image-Text Contrastive

CLIP (Radford et al., 2021) applies contrastive learning across modalities. Given a batch of (image,text)(image, text) pairs, pull matching pairs together and push all non-matching pairs apart. The loss is symmetric InfoNCE applied to both the image-to-text and text-to-image directions. CLIP learns representations that enable zero-shot transfer via natural language prompts.

Common Confusions

Watch Out

log N caps the InfoNCE bound, not the representation

The logN\log N ceiling lives in the bound IlogNLI \geq \log N - \mathcal{L}, not in the representation itself. The encoder can encode more MI than logN\log N; InfoNCE just cannot certify it with NN candidates. In practice, beyond a certain NN the bound is tight enough that further gains diminish and compute cost grows quadratically (all pairwise similarities). Diminishing returns set in around N4096N \sim 4096 for vision tasks.

Watch Out

Contrastive learning does not maximize all mutual information

It maximizes MI between views as filtered by the augmentation distribution. If augmentations destroy color information, the learned representation will not encode color. The augmentation policy implicitly defines what information is task-relevant.

Watch Out

The projection head is not the final representation

SimCLR trains with a projection head, but you discard it at evaluation time. The representation before the projection head generalizes better because the head absorbs augmentation-specific information that hurts transfer.

Summary

  • Contrastive learning optimizes a lower bound on mutual information between views; the bound itself saturates at logN\log N for NN candidates, but the encoder can carry more MI than the bound certifies
  • The loss balances alignment (positive pairs close) and uniformity (representations spread on the hypersphere)
  • Augmentation choice determines what information is preserved in the learned representation
  • Large negative sets improve the bound; MoCo achieves this with a momentum encoder and queue
  • CLIP extends the paradigm to cross-modal (image-text) contrastive learning

Exercises

ExerciseCore

Problem

You train SimCLR with batch size B=128B = 128, producing 2B=2562B = 256 views. How many negative pairs does each anchor have? What is the upper bound on mutual information recoverable from the InfoNCE loss?

ExerciseAdvanced

Problem

Explain why the projection head in SimCLR improves downstream performance even though it is discarded. Specifically: if color jitter is used as an augmentation, what information does the projection head learn to discard, and why is that harmful for downstream tasks?

Related Comparisons

References

Canonical:

  • Oord et al., "Representation Learning with Contrastive Predictive Coding" (2018), Section 2
  • Chen et al., "A Simple Framework for Contrastive Learning" (SimCLR, 2020)

Current:

  • He et al., "Momentum Contrast for Unsupervised Visual Representation Learning" (MoCo, 2020)
  • Wang & Isola, "Understanding Contrastive Representation Learning through Alignment and Uniformity" (ICML 2020)
  • Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021)

Next Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1