Skip to main content

Numerical Optimization

Mirror Descent and Frank-Wolfe

Mirror descent generalizes gradient descent via Bregman divergences, recovering multiplicative weights and exponentiated gradient as special cases. Frank-Wolfe replaces projections with linear minimization, making it projection-free.

AdvancedTier 2StableSupporting~55 min

Why This Matters

Standard gradient descent uses the Euclidean distance to measure how far you move at each step. This is natural for unconstrained optimization in Rd\mathbb{R}^d, but it is often the wrong geometry for constrained problems. Optimizing over the simplex (probability distributions), positive definite matrices, or combinatorial polytopes calls for a distance measure that respects the constraint geometry.

Hide overviewShow overview
Infographic on mirror descent: the dual / primal mapping via a Bregman divergence, the update rule using a strongly convex mirror map, classical examples (entropy mirror map gives multiplicative weights / exponentiated gradient on the simplex, squared-norm mirror map recovers gradient descent), and connections to online learning, bandits, and optimization on simplex-constrained problems.
Mirror descent generalizes gradient descent to constrained domains via Bregman divergences. The mirror map's choice determines the geometry of the update.

Mirror descent replaces Euclidean distance with a Bregman divergence, adapting the algorithm to the problem geometry. Frank-Wolfe goes further: it avoids projection entirely by solving a linear subproblem over the constraint set at each step.

theorem visual

Constraint Geometry Split

Mirror descent shapes distance to the constraint geometry, while Frank-Wolfe uses a boundary oracle and moves along a feasible segment.

MIRROR GEOMETRYinterior mirror update
mirror mapBregman divergencebends the stepgradient directionEuclidean guess

Step

geometry-aware interior step

Objective

Why it helps

The mirror map reweights distance so the update respects the constraint geometry.

Bregman Divergences

Definition

Bregman Divergence

Given a strictly convex, differentiable function ϕ:XR\phi: \mathcal{X} \to \mathbb{R} (the mirror map or distance-generating function), the Bregman divergence is:

Dϕ(x,y)=ϕ(x)ϕ(y)ϕ(y),xyD_\phi(x, y) = \phi(x) - \phi(y) - \langle \nabla \phi(y), x - y \rangle

This measures the gap between ϕ(x)\phi(x) and its first-order Taylor approximation at yy. It is always non-negative by convexity of ϕ\phi, but it is generally not symmetric: Dϕ(x,y)Dϕ(y,x)D_\phi(x, y) \neq D_\phi(y, x).

Key examples:

  • ϕ(x)=12x22\phi(x) = \frac{1}{2}\|x\|_2^2: Bregman divergence is Dϕ(x,y)=12xy22D_\phi(x, y) = \frac{1}{2}\|x - y\|_2^2 (squared Euclidean distance). Recovers standard gradient descent.
  • ϕ(x)=ixilogxi\phi(x) = \sum_i x_i \log x_i (negative entropy): Bregman divergence is the KL divergence DKL(xy)=ixilog(xi/yi)D_{\text{KL}}(x \| y) = \sum_i x_i \log(x_i / y_i). Natural for the simplex.

Mirror Descent

Definition

Mirror Descent Update

Given a convex function ff to minimize over a convex set X\mathcal{X}, the mirror descent update with step size η\eta is:

xt+1=argminxX{ηf(xt),x+Dϕ(x,xt)}x_{t+1} = \arg\min_{x \in \mathcal{X}} \left\{ \eta \langle \nabla f(x_t), x \rangle + D_\phi(x, x_t) \right\}

Equivalently, in the "dual space": compute θt+1=ϕ(xt)ηf(xt)\theta_{t+1} = \nabla \phi(x_t) - \eta \nabla f(x_t), then "mirror back" via xt+1=argminxXDϕ(x,(ϕ)1(θt+1))x_{t+1} = \arg\min_{x \in \mathcal{X}} D_\phi(x, (\nabla \phi)^{-1}(\theta_{t+1})).

The update finds the point in X\mathcal{X} that best balances two objectives: move in the direction of the negative gradient (minimize the linear term) while staying close to xtx_t (minimize the Bregman divergence term).

Special cases:

  • Euclidean mirror map (ϕ=1222\phi = \frac{1}{2}\|\cdot\|_2^2): recovers projected gradient descent.
  • Entropic mirror map (ϕ=xilogxi\phi = \sum x_i \log x_i) over the simplex: recovers the exponentiated gradient / multiplicative weights update xt+1,ixt,iexp(ηif(xt))x_{t+1,i} \propto x_{t,i} \exp(-\eta \nabla_i f(x_t)).
Theorem

Mirror Descent Convergence

Statement

After TT iterations of mirror descent with step size η=R2LT\eta = \frac{R\sqrt{2}}{L\sqrt{T}}, where R2=maxxXDϕ(x,x1)R^2 = \max_{x \in \mathcal{X}} D_\phi(x, x_1) and f(x)L\|\nabla f(x)\|_* \leq L for all xx:

f(1Tt=1Txt)f(x)RL2Tf\left(\frac{1}{T}\sum_{t=1}^T x_t\right) - f(x^*) \leq \frac{RL\sqrt{2}}{\sqrt{T}}

This is the O(1/T)O(1/\sqrt{T}) rate for convex Lipschitz optimization.

Intuition

The rate depends on the ratio R/σR/\sigma where RR is the "diameter" in Bregman divergence and LL is the gradient bound in the dual norm. Choosing ϕ\phi to match the geometry of the constraint set can make RR much smaller than the Euclidean diameter, giving faster effective convergence.

Proof Sketch

The standard mirror descent analysis telescopes the Bregman divergence to the optimum: Dϕ(x,xt+1)Dϕ(x,xt)η(f(xt)f(x))+η22σf(xt)2D_\phi(x^*, x_{t+1}) \leq D_\phi(x^*, x_t) - \eta(f(x_t) - f(x^*)) + \frac{\eta^2}{2\sigma}\|\nabla f(x_t)\|_*^2. Sum from t=1t = 1 to TT, use Dϕ(x,x1)R2D_\phi(x^*, x_1) \leq R^2 and Dϕ(x,xT+1)0D_\phi(x^*, x_{T+1}) \geq 0, then optimize η\eta.

Why It Matters

Mirror descent with the entropic mirror map on the simplex achieves O(logd/T)O(\sqrt{\log d / T}) regret, compared to O(d/T)O(\sqrt{d / T}) for projected gradient descent. The logarithmic dependence on dimension dd is a huge advantage in high-dimensional settings like online learning with exponentially many experts.

Failure Mode

The choice of mirror map ϕ\phi is crucial. A poorly chosen ϕ\phi gives a Bregman diameter RR that is no better (or worse) than Euclidean distance. The analysis also requires that the mirror map be strongly convex with respect to the chosen norm, which may be difficult to verify. For non-Lipschitz objectives, the rate degrades.

Frank-Wolfe (Conditional Gradient)

Definition

Frank-Wolfe Update

The Frank-Wolfe (conditional gradient) update for minimizing a smooth convex function ff over a compact convex set X\mathcal{X}:

  1. Solve the linear minimization oracle (LMO): vt=argminvXf(xt),vv_t = \arg\min_{v \in \mathcal{X}} \langle \nabla f(x_t), v \rangle
  2. Update: xt+1=xt+γt(vtxt)x_{t+1} = x_t + \gamma_t (v_t - x_t) where γt[0,1]\gamma_t \in [0, 1]

The step direction vtxtv_t - x_t always points into the feasible set, so no projection is needed.

The LMO is often much cheaper than projection. Over the nuclear norm ball, the LMO requires a single top singular vector computation (O(d2)O(d^2)), while projection requires a full SVD (O(d3)O(d^3)). Over polytopes, the LMO is a linear program.

Theorem

Frank-Wolfe Convergence Rate

Statement

With step sizes γt=2/(t+2)\gamma_t = 2/(t+2), the Frank-Wolfe algorithm satisfies:

f(xT)f(x)2Ldiam(X)2T+2f(x_T) - f(x^*) \leq \frac{2L \cdot \text{diam}(\mathcal{X})^2}{T + 2}

where diam(X)=maxx,yXxy\text{diam}(\mathcal{X}) = \max_{x, y \in \mathcal{X}} \|x - y\|.

Intuition

The O(1/T)O(1/T) rate is slower than the O(1/T2)O(1/T^2) of accelerated gradient descent, but each iteration only requires a linear minimization, not a projection. The trade-off is worth it when the LMO is cheap and projection is expensive.

Proof Sketch

By smoothness, f(xt+1)f(xt)+γtf(xt),vtxt+Lγt22vtxt2f(x_{t+1}) \leq f(x_t) + \gamma_t \langle \nabla f(x_t), v_t - x_t \rangle + \frac{L \gamma_t^2}{2}\|v_t - x_t\|^2. Since vtv_t minimizes the linear term over X\mathcal{X}, we have f(xt),vtxtf(xt),xxt(f(xt)f(x))\langle \nabla f(x_t), v_t - x_t \rangle \leq \langle \nabla f(x_t), x^* - x_t \rangle \leq -(f(x_t) - f(x^*)) by convexity. Substituting and setting γt=2/(t+2)\gamma_t = 2/(t+2) gives the result by induction.

Why It Matters

Frank-Wolfe iterates are sparse: each iterate is a convex combination of at most TT vertices of X\mathcal{X}. This makes it natural for problems where sparse solutions are desirable (e.g., sparse optimization, low-rank matrix completion). The method also never leaves the feasible set, which matters when feasibility has physical meaning.

Failure Mode

The O(1/T)O(1/T) rate cannot be improved in general for Frank-Wolfe without modifications (away steps, pairwise steps). The "zig-zagging" phenomenon arises when the optimum lies on the boundary of X\mathcal{X}, and especially on a low-dimensional face: the iterates oscillate between vertices of the active face instead of moving cleanly along it, which is precisely the case the away-step and pairwise variants are designed to fix. For strongly convex objectives, vanilla Frank-Wolfe does not achieve the linear rate that projected gradient descent achieves.

Common Confusions

Watch Out

Mirror descent is not just projected gradient descent with a different norm

While projected gradient descent in the 1\ell_1 norm exists, it is not the same as mirror descent with the entropic mirror map. Mirror descent changes the geometry of the update step itself (using Bregman divergence instead of squared distance), not just the metric used in the projection. The two coincide only when ϕ=122\phi = \frac{1}{2}\|\cdot\|^2.

Watch Out

Frank-Wolfe and mirror descent solve different bottlenecks

Mirror descent replaces Euclidean distance with a better geometry but still requires a Bregman projection. Frank-Wolfe eliminates projection entirely by using a linear oracle. If projection is cheap (e.g., onto a ball), use projected gradient descent or mirror descent. If projection is expensive but linear minimization is cheap (e.g., over a polytope or nuclear norm ball), use Frank-Wolfe.

Canonical Examples

Example

Multiplicative weights from mirror descent

On the probability simplex Δd={x0:ixi=1}\Delta_d = \{x \geq 0 : \sum_i x_i = 1\}, use the negative entropy ϕ(x)=ixilogxi\phi(x) = \sum_i x_i \log x_i. The mirror descent update for a linear loss f(x)=gt,xf(x) = \langle g_t, x \rangle gives:

xt+1,i=xt,iexp(ηgt,i)jxt,jexp(ηgt,j)x_{t+1,i} = \frac{x_{t,i} \exp(-\eta g_{t,i})}{\sum_j x_{t,j} \exp(-\eta g_{t,j})}

This is the multiplicative weights update, with O(logd/T)O(\sqrt{\log d / T}) regret. The logd\log d (instead of dd) dependence on dimension is the key advantage over Euclidean methods.

Exercises

ExerciseCore

Problem

Verify that the Bregman divergence for ϕ(x)=12x22\phi(x) = \frac{1}{2}\|x\|_2^2 is the squared Euclidean distance 12xy22\frac{1}{2}\|x - y\|_2^2, and that the mirror descent update reduces to the projected gradient descent update.

ExerciseAdvanced

Problem

Consider Frank-Wolfe on the 1\ell_1 ball {x:x11}\{x : \|x\|_1 \leq 1\} in Rd\mathbb{R}^d. What does the linear minimization oracle compute? What is the sparsity of the iterate xTx_T after TT steps if x0=0x_0 = 0?

References

Canonical:

  • Nemirovski & Yudin, Problem Complexity and Method Efficiency in Optimization (1983), Chapter 5
  • Frank & Wolfe, "An algorithm for quadratic programming" (1956)
  • Bubeck, "Convex Optimization: Algorithms and Complexity" (2015), Sections 4.2, 4.3

Current:

  • Jaggi, "Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization" (2013)
  • Hazan, Introduction to Online Convex Optimization (2016), Chapter 5

Next Topics

  • Projected gradient descent: the standard projection-based approach that mirror descent generalizes
  • Coordinate descent: another projection-free approach for structured problems

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

1

Graph-backed continuations