Skip to main content

Numerical Optimization

Proximal Gradient Methods

Optimize composite objectives by alternating gradient steps on smooth terms with proximal operators on nonsmooth terms. ISTA and its accelerated variant FISTA.

CoreTier 1StableSupporting~55 min

Why This Matters

Many ML objectives have the form "smooth loss plus nonsmooth regularizer." Lasso regression is the canonical example: squared error (smooth) plus an 1\ell_1 penalty (nonsmooth). You cannot apply standard gradient descent to the full objective because the 1\ell_1 norm is not differentiable at zero. Subgradient methods apply, but converge slowly (O(1/k)O(1/\sqrt{k}) for convex problems).

Proximal gradient methods solve this cleanly. They split the smooth and nonsmooth parts: a gradient step on the smooth part, then a proximal step on the nonsmooth part. ISTA recovers the O(1/k)O(1/k) rate of gradient descent on smooth functions; FISTA reaches the optimal O(1/k2)O(1/k^2) rate via Nesterov acceleration. This splitting strategy is the algorithmic backbone of sparse optimization, constrained optimization, and structured regularization across ML.

Mental Model

Think of two operations in sequence. First, slide downhill along the smooth loss surface (a gradient step on ff). This produces an intermediate point that ignores the nonsmooth regularizer entirely. Second, apply the proximal operator of gg to that intermediate point: find the nearest point (in a squared-distance sense) that also keeps gg small. For the 1\ell_1 norm, this second step is soft-thresholding, which pushes small coordinates exactly to zero and shrinks large ones toward zero. The two-step rhythm --- gradient, then proximal --- repeats until convergence.

When gg is the indicator function of a convex constraint set, the proximal step reduces to Euclidean projection, and the algorithm becomes projected gradient descent.

Formal Setup and Notation

We want to solve the composite minimization problem:

minxRd  F(x)=f(x)+g(x)\min_{x \in \mathbb{R}^d} \; F(x) = f(x) + g(x)

where f ⁣:RdRf \colon \mathbb{R}^d \to \mathbb{R} is convex with LL-Lipschitz continuous gradient, meaning f(x)f(y)Lxy\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\| for all x,yx, y. The function g ⁣:RdR{+}g \colon \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\} is convex, lower semicontinuous, and proper (its domain is nonempty), but possibly nonsmooth.

The Lipschitz constant LL of f\nabla f controls the step size: we set t1/Lt \leq 1/L to ensure the gradient step does not overshoot. For quadratic losses like f(x)=12Axb2f(x) = \frac{1}{2}\|Ax - b\|^2, the constant is L=λmax(ATA)L = \lambda_{\max}(A^TA), the largest eigenvalue of ATAA^TA.

The composite structure f+gf + g appears throughout ML:

Problemf(x)f(x)g(x)g(x)
Lasso12Axb2\frac{1}{2}\|Ax - b\|^2λx1\lambda\|x\|_1
Ridge12Axb2\frac{1}{2}\|Ax - b\|^2λ2x2\frac{\lambda}{2}\|x\|^2
Elastic net12Axb2\frac{1}{2}\|Ax - b\|^2λ1x1+λ22x2\lambda_1\|x\|_1 + \frac{\lambda_2}{2}\|x\|^2
Constrained minimizationf(x)f(x)IC(x)I_C(x) (indicator of set CC)
Group lasso12Axb2\frac{1}{2}\|Ax - b\|^2λjxGj2\lambda \sum_j \|x_{G_j}\|_2
Nuclear norm regularizationloss on matrixλX\lambda \|X\|_*
Definition

Proximal Operator

The proximal operator of a convex function gg at point xx is:

proxg(x)=argminy(g(y)+12yx2)\mathrm{prox}_g(x) = \arg\min_{y} \left( g(y) + \frac{1}{2}\|y - x\|^2 \right)

It finds the point that balances being close to xx (the quadratic term) with having small gg value. The proximal operator always exists and is unique when gg is convex, lower semicontinuous, and proper.

When gg has a scaling parameter γ>0\gamma > 0, write proxγg(x)=argminy(g(y)+12γyx2)\mathrm{prox}_{\gamma g}(x) = \arg\min_y (g(y) + \frac{1}{2\gamma}\|y - x\|^2). The parameter γ\gamma controls the trade-off: large γ\gamma penalizes gg more heavily; small γ\gamma keeps the output closer to xx.

Key properties of the proximal operator

The proximal operator has several properties that make it well-suited to iterative algorithms:

  1. Firm nonexpansiveness. For any x,yx, y: proxg(x)proxg(y)2xy,proxg(x)proxg(y)\|\mathrm{prox}_g(x) - \mathrm{prox}_g(y)\|^2 \leq \langle x - y, \mathrm{prox}_g(x) - \mathrm{prox}_g(y) \rangle This is stronger than nonexpansiveness (proxg(x)proxg(y)xy\|\mathrm{prox}_g(x) - \mathrm{prox}_g(y)\| \leq \|x - y\|) and guarantees that iterating the proximal operator does not amplify errors.

  2. Fixed-point characterization. xx^* minimizes gg if and only if x=proxγg(x)x^* = \mathrm{prox}_{\gamma g}(x^*) for any γ>0\gamma > 0. So minimizers of gg are exactly the fixed points of its proximal operator.

  3. Resolvent identity. proxg(x)+proxg(x)=x\mathrm{prox}_g(x) + \mathrm{prox}_{g^*}(x) = x, where gg^* is the convex conjugate (Fenchel conjugate) of gg. This duality between the proximal operators of gg and gg^* underlies many algorithmic decompositions.

  4. Composition with affine maps. If g(x)=h(Ax+b)g(x) = h(Ax + b) and AA is orthogonal, then proxg(x)=AT(proxh(Ax+b)b)\mathrm{prox}_g(x) = A^T(\mathrm{prox}_h(Ax + b) - b). For general AA, no simple composition rule exists; this is why ADMM introduces auxiliary variables to handle affine compositions.

Definition

Moreau Envelope

The Moreau envelope of gg with parameter γ>0\gamma > 0 is:

Mgγ(x)=miny(g(y)+12γyx2)M_g^\gamma(x) = \min_{y} \left( g(y) + \frac{1}{2\gamma}\|y - x\|^2 \right)

It is a smooth approximation of gg. Even when gg is nonsmooth, the Moreau envelope MgγM_g^\gamma is differentiable with gradient Mgγ(x)=1γ(xproxγg(x))\nabla M_g^\gamma(x) = \frac{1}{\gamma}(x - \mathrm{prox}_{\gamma g}(x)). As γ0\gamma \to 0, the Moreau envelope converges pointwise to gg.

The Moreau envelope preserves minimizers: argminxg(x)=argminxMgγ(x)\arg\min_x g(x) = \arg\min_x M_g^\gamma(x) for any γ>0\gamma > 0. It also preserves the minimum value. This means we can study the smooth function MgγM_g^\gamma instead of the nonsmooth gg without changing the optimization problem's solution set.

The gradient Mgγ(x)=1γ(xproxγg(x))\nabla M_g^\gamma(x) = \frac{1}{\gamma}(x - \mathrm{prox}_{\gamma g}(x)) is 1γ\frac{1}{\gamma}-Lipschitz continuous. This makes MgγM_g^\gamma a C1C^1 function with bounded curvature, even when gg itself is piecewise linear (like the 1\ell_1 norm). The Moreau envelope therefore provides a principled way to "smooth" nonsmooth objectives for analysis.

Closed-Form Proximal Operators

The practical usefulness of proximal methods depends on whether proxg\mathrm{prox}_g has a closed form. For many regularizers used in ML, it does.

Soft-thresholding (1\ell_1 norm). For g(x)=λx1g(x) = \lambda \|x\|_1:

[proxλ1(x)]i=sign(xi)max(xiλ,0)[\mathrm{prox}_{\lambda \|\cdot\|_1}(x)]_i = \mathrm{sign}(x_i) \max(|x_i| - \lambda, 0)

This shrinks each coordinate toward zero and sets coordinates with xiλ|x_i| \leq \lambda exactly to zero. The 1\ell_1 proximal operator is the mechanism behind sparsity in lasso and sparse recovery.

Indicator function. For g(x)=IC(x)g(x) = I_C(x) (zero if xCx \in C, ++\infty otherwise), the proximal operator is the Euclidean projection onto CC: proxIC(x)=ΠC(x)\mathrm{prox}_{I_C}(x) = \Pi_C(x).

2\ell_2 norm (group soft-thresholding). For g(x)=λx2g(x) = \lambda\|x\|_2 (not squared):

proxλ2(x)=(1λx2)+x\mathrm{prox}_{\lambda\|\cdot\|_2}(x) = \left(1 - \frac{\lambda}{\|x\|_2}\right)_+ x

where (a)+=max(a,0)(a)_+ = \max(a, 0). This shrinks the entire vector toward zero and sets prox(x)=0\mathrm{prox}(x) = 0 when x2λ\|x\|_2 \leq \lambda. It is the building block of group lasso, where each group is shrunk as a unit.

Squared 2\ell_2 norm. For g(x)=λ2x2g(x) = \frac{\lambda}{2}\|x\|^2 (ridge penalty):

proxλ22(x)=11+λx\mathrm{prox}_{\frac{\lambda}{2}\|\cdot\|^2}(x) = \frac{1}{1 + \lambda} x

This scales all coordinates uniformly. No coordinate is set to zero, which is why ridge shrinks but does not select variables.

Nuclear norm. For g(X)=λX=λiσi(X)g(X) = \lambda\|X\|_* = \lambda \sum_i \sigma_i(X), the proximal operator applies soft-thresholding to the singular values: compute the SVD X=UΣVTX = U\Sigma V^T, threshold the diagonal entries of Σ\Sigma, and reconstruct. This costs one SVD per iteration.

Elastic net. For g(x)=λ1x1+λ22x2g(x) = \lambda_1\|x\|_1 + \frac{\lambda_2}{2}\|x\|^2, the proximal operator composes scaling and soft-thresholding: first soft-threshold with parameter λ1\lambda_1, then scale by 11+λ2\frac{1}{1+\lambda_2}.

The ISTA Algorithm

The Iterative Shrinkage-Thresholding Algorithm (ISTA) repeats:

xk+1=proxtkg ⁣(xktkf(xk))x_{k+1} = \mathrm{prox}_{t_k g}\!\left(x_k - t_k \nabla f(x_k)\right)

where tk1/Lt_k \leq 1/L is the step size and LL is the Lipschitz constant of f\nabla f.

Step by step: (1) compute the gradient f(xk)\nabla f(x_k); (2) take a gradient step on the smooth part, producing an intermediate point z=xktkf(xk)z = x_k - t_k \nabla f(x_k); (3) apply the proximal operator of gg to zz, handling the nonsmooth part. The gradient step minimizes a local quadratic model of ff around xkx_k; the proximal step incorporates gg at minimal additional cost.

Why the step size is 1/L1/L

The step size t=1/Lt = 1/L comes from the quadratic upper bound on smooth convex functions. Because f\nabla f is LL-Lipschitz, we have the descent lemma:

f(y)f(x)+f(x)T(yx)+L2yx2f(y) \leq f(x) + \nabla f(x)^T(y - x) + \frac{L}{2}\|y - x\|^2

Minimizing the right side plus g(y)g(y) over yy gives exactly the proximal gradient step with t=1/Lt = 1/L. The quadratic upper bound acts as a majorizer: ISTA minimizes this majorizer at each step, guaranteeing that F(xk+1)F(xk)F(x_{k+1}) \leq F(x_k) (monotone decrease).

Backtracking line search

When LL is unknown or expensive to compute (e.g., computing λmax(ATA)\lambda_{\max}(A^TA) for a large matrix), a backtracking line search finds a suitable step size. Start with some L^\hat{L}, then increase L^ηL^\hat{L} \leftarrow \eta \hat{L} (with η>1\eta > 1, typically η=2\eta = 2) until the sufficient decrease condition holds:

f(proxg/L^(x1L^f(x)))f(x)+f(x)T(px)+L^2px2f(\mathrm{prox}_{g/\hat{L}}(x - \frac{1}{\hat{L}}\nabla f(x))) \leq f(x) + \nabla f(x)^T(p - x) + \frac{\hat{L}}{2}\|p - x\|^2

where pp is the proximal step output. This avoids computing LL exactly while preserving the O(1/k)O(1/k) convergence guarantee.

The FISTA Algorithm

Fast ISTA (FISTA), introduced by Beck and Teboulle (2009), adds Nesterov momentum to accelerate convergence from O(1/k)O(1/k) to O(1/k2)O(1/k^2).

Initialize x0=y1x_0 = y_1, t1=1t_1 = 1. Then repeat:

xk=proxsg ⁣(yksf(yk))x_k = \mathrm{prox}_{s \cdot g}\!\left(y_k - s \cdot \nabla f(y_k)\right)

tk+1=1+1+4tk22t_{k+1} = \frac{1 + \sqrt{1 + 4t_k^2}}{2}

yk+1=xk+tk1tk+1(xkxk1)y_{k+1} = x_k + \frac{t_k - 1}{t_{k+1}}(x_k - x_{k-1})

The momentum coefficient tk1tk+1\frac{t_k - 1}{t_{k+1}} starts near zero and approaches 11 as kk grows. At large kk, tkk/2t_k \approx k/2, so the coefficient is approximately k2k+1\frac{k-2}{k+1}, similar to the Nesterov acceleration schedule for smooth gradient descent.

The extrapolation step yk+1y_{k+1} overshoots past xkx_k in the direction of recent movement. This look-ahead is what distinguishes acceleration from plain gradient descent; the algorithm uses trajectory information, not just the current gradient.

FISTA with monotone variant

Because FISTA's momentum can cause the objective to increase between iterations, a monotone variant (Beck and Teboulle, 2009; O'Donoghue and Candes, 2015) modifies the update:

yk+1=xk+tk1tk+1(xkxk1)y_{k+1} = x_k + \frac{t_k - 1}{t_{k+1}}(x_k - x_{k-1})

is replaced by choosing yk+1y_{k+1} based on whichever of xk,xk1x_k, x_{k-1} has the smaller objective value. This sacrifices some theoretical elegance but avoids the oscillations that slow FISTA on ill-conditioned problems.

Restarting for strongly convex problems

When ff is strongly convex with parameter μ>0\mu > 0, the optimal rate for first-order methods improves from O(1/k2)O(1/k^2) to linear convergence O((1μ/L)k)O((1 - \sqrt{\mu/L})^k). FISTA does not automatically exploit strong convexity, but a restarting scheme does: restart the momentum sequence (tk1t_k \leftarrow 1) whenever F(xk)>F(xk1)F(x_k) > F(x_{k-1}) or at fixed intervals of O(L/μ)O(\sqrt{L/\mu}) iterations. This achieves the optimal linear rate for strongly convex composite problems.

The condition number κ=L/μ\kappa = L/\mu controls the restart interval. Ill-conditioned problems (κ1\kappa \gg 1) restart less frequently; well-conditioned problems restart often.

Main Theorems

Theorem

ISTA Convergence Rate

Statement

Let F=minxF(x)F^* = \min_x F(x). ISTA with constant step size t=1/Lt = 1/L satisfies:

F(xk)FLx0x22kF(x_k) - F^* \leq \frac{L \|x_0 - x^*\|^2}{2k}

where xx^* is a minimizer of FF.

Intuition

The convergence rate is O(1/k)O(1/k): after kk iterations, suboptimality shrinks proportionally to 1/k1/k. This matches gradient descent on smooth convex functions. The proximal step handles nonsmoothness without degrading the rate. Compare to subgradient methods, which achieve only O(1/k)O(1/\sqrt{k}) on the same problem class.

Proof Sketch

The proof uses the descent lemma for smooth functions combined with the optimality condition for the proximal step. Since f\nabla f is LL-Lipschitz:

f(xk+1)f(xk)+f(xk)T(xk+1xk)+L2xk+1xk2f(x_{k+1}) \leq f(x_k) + \nabla f(x_k)^T(x_{k+1} - x_k) + \frac{L}{2}\|x_{k+1} - x_k\|^2

The proximal optimality condition gives: 0g(xk+1)+L(xk+1xk+1Lf(xk))0 \in \partial g(x_{k+1}) + L(x_{k+1} - x_k + \frac{1}{L}\nabla f(x_k)). Rearranging and using convexity of gg:

g(xk+1)g(x)+Lxk1Lf(xk)xk+1,xk+1xg(x_{k+1}) \leq g(x) + L\langle x_k - \frac{1}{L}\nabla f(x_k) - x_{k+1}, x_{k+1} - x\rangle

for any xx. Combining with the descent inequality and setting x=xx = x^*:

F(xk+1)FL2(xkx2xk+1x2)F(x_{k+1}) - F^* \leq \frac{L}{2}(\|x_k - x^*\|^2 - \|x_{k+1} - x^*\|^2)

Telescope from k=0k = 0 to K1K-1 and use the fact that xKx20\|x_K - x^*\|^2 \geq 0 to obtain F(xK)FLx0x22KF(x_K) - F^* \leq \frac{L\|x_0 - x^*\|^2}{2K}.

Why It Matters

This rate shows that proximal gradient descent matches gradient descent even though part of the objective is nonsmooth. The proximal operator absorbs the nonsmoothness at no cost to the convergence rate; the price is only computational (evaluating proxg\mathrm{prox}_g per iteration).

Failure Mode

The O(1/k)O(1/k) rate is tight for ISTA: there exist convex instances where the bound is achieved. The rate also depends on LL; if LL is large (ill-conditioned smooth part), convergence is slow per iteration. If LL is unknown and overestimated, the step size becomes needlessly small. If underestimated, the algorithm may diverge. Backtracking resolves this at the cost of extra gradient evaluations.

Theorem

FISTA Accelerated Convergence Rate

Statement

FISTA satisfies:

F(xk)F2Lx0x2(k+1)2F(x_k) - F^* \leq \frac{2L \|x_0 - x^*\|^2}{(k+1)^2}

Intuition

Acceleration improves the rate from O(1/k)O(1/k) to O(1/k2)O(1/k^2). After 100 iterations, ISTA reduces the gap by 100×100\times; FISTA reduces it by 10,000×10{,}000\times. The momentum term lets the algorithm use trajectory information --- not just the current gradient --- to take longer effective steps.

Proof Sketch

Define a Lyapunov function:

Ek=tk2(F(xk)F)+L2vkx2E_k = t_k^2(F(x_k) - F^*) + \frac{L}{2}\|v_k - x^*\|^2

where vk=xk1+tk(xkxk1)v_k = x_{k-1} + t_k(x_k - x_{k-1}) is an auxiliary sequence. The proof shows Ek+1EkE_{k+1} \leq E_k by substituting the FISTA update rules and using the descent property of the proximal step. The key algebraic identity is tk+12tk+1tk2t_{k+1}^2 - t_{k+1} \leq t_k^2, which follows from the specific recurrence tk+1=(1+1+4tk2)/2t_{k+1} = (1 + \sqrt{1 + 4t_k^2})/2. Since EkE_k is nonincreasing and the vkx2\|v_k - x^*\|^2 term is nonneg:

tk2(F(xk)F)EkE1=F(x0)F+L2x0x2t_k^2(F(x_k) - F^*) \leq E_k \leq E_1 = F(x_0) - F^* + \frac{L}{2}\|x_0 - x^*\|^2

Since tk(k+1)/2t_k \geq (k+1)/2, dividing both sides by tk2t_k^2 yields the O(1/k2)O(1/k^2) bound.

Why It Matters

O(1/k2)O(1/k^2) is the optimal rate for first-order methods on this problem class (Nesterov, 1983). No method using only gradient evaluations and proximal operators can do better in the worst case. FISTA achieves this optimality bound for composite objectives with the same per-iteration cost as ISTA.

Failure Mode

FISTA's iterates can oscillate: F(xk+1)F(x_{k+1}) may exceed F(xk)F(x_k). This non-monotonicity makes FISTA harder to use as a subroutine (e.g., stopping criteria based on objective decrease may trigger prematurely). Monotone variants and adaptive restart address this. For strongly convex ff, plain FISTA wastes the strong convexity; use restarting or a modified momentum sequence to get linear convergence.

Optimality and Lower Bounds

The O(1/k2)O(1/k^2) rate of FISTA is optimal among first-order methods for convex optimization with Lipschitz gradients. Nesterov (1983, 2004) proved a matching lower bound: for any first-order method that queries f\nabla f at kk points, there exists a convex function with LL-Lipschitz gradient such that:

F(xk)F3Lx0x232(k+1)2F(x_k) - F^* \geq \frac{3L\|x_0 - x^*\|^2}{32(k+1)^2}

So FISTA's upper bound 2Lx0x2(k+1)2\frac{2L\|x_0 - x^*\|^2}{(k+1)^2} matches the lower bound up to a constant factor. For ISTA, the O(1/k)O(1/k) rate is also tight: there exist instances achieving it.

Under strong convexity (ff has parameter μ>0\mu > 0), the optimal rate improves to O((1μ/L)k)O((1 - \sqrt{\mu/L})^k) (linear convergence). Plain ISTA achieves O((1μ/L)k)O((1 - \mu/L)^k); FISTA with restart achieves the optimal O((1μ/L)k)O((1 - \sqrt{\mu/L})^k). The gap between 1μ/L1 - \mu/L and 1μ/L1 - \sqrt{\mu/L} is significant when the condition number κ=L/μ\kappa = L/\mu is large: ISTA needs O(κ)O(\kappa) iterations for ϵ\epsilon-accuracy; FISTA with restart needs O(κ)O(\sqrt{\kappa}).

Canonical Examples

Example

Lasso via ISTA

The Lasso problem is minx12Axb2+λx1\min_x \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1. Here f(x)=12Axb2f(x) = \frac{1}{2}\|Ax - b\|^2 with f(x)=AT(Axb)\nabla f(x) = A^T(Ax - b) and L=ATA=σmax2(A)L = \|A^TA\| = \sigma_{\max}^2(A), the squared largest singular value of AA. The proximal step is coordinate-wise soft-thresholding. Each ISTA iteration: compute z=xk1LAT(Axkb)z = x_k - \frac{1}{L} A^T(Ax_k - b), then xk+1=sign(z)max(zλ/L,0)x_{k+1} = \mathrm{sign}(z) \odot \max(|z| - \lambda/L, 0).

For a 1000×5001000 \times 500 matrix AA, each iteration costs O(nd)O(nd) for the matrix-vector products AxkAx_k and AT()A^T(\cdot), plus O(d)O(d) for the soft-thresholding. The total per-iteration cost is dominated by the two matrix-vector multiplies.

Example

Projected gradient descent as a special case

When g=ICg = I_C is the indicator of a convex set CC, the proximal operator is projection onto CC. So ISTA becomes projected gradient descent: xk+1=ΠC(xktf(xk))x_{k+1} = \Pi_C(x_k - t \nabla f(x_k)). Constrained smooth optimization is a special case of proximal gradient methods. Common constraint sets with cheap projections include the simplex (O(dlogd)O(d \log d)), the 2\ell_2 ball (O(d)O(d)), and box constraints (O(d)O(d) coordinate-wise clipping).

Example

Elastic net

The elastic net combines 1\ell_1 and 2\ell_2 penalties: g(x)=λ1x1+λ22x2g(x) = \lambda_1\|x\|_1 + \frac{\lambda_2}{2}\|x\|^2. Its proximal operator is:

proxg(x)=11+λ2sign(x)max(xλ1,0)\mathrm{prox}_g(x) = \frac{1}{1 + \lambda_2}\mathrm{sign}(x) \odot \max(|x| - \lambda_1, 0)

First soft-threshold with parameter λ1\lambda_1, then scale by 11+λ2\frac{1}{1+\lambda_2}. This produces solutions that are sparse (from the 1\ell_1 part) and have bounded 2\ell_2 norm (from the 2\ell_2 part). The elastic net proximal operator is separable across coordinates, so it costs O(d)O(d).

Example

Low-rank matrix completion

For matrix completion with nuclear norm regularization: minX12(i,j)Ω(XijMij)2+λX\min_X \frac{1}{2}\sum_{(i,j) \in \Omega}(X_{ij} - M_{ij})^2 + \lambda\|X\|_*, where Ω\Omega is the set of observed entries. Here ff is a smooth quadratic loss on observed entries, and g(X)=λXg(X) = \lambda\|X\|_*. The proximal operator requires a singular value decomposition: compute X=UΣVTX = U\Sigma V^T, then soft-threshold the singular values σimax(σiλ,0)\sigma_i \mapsto \max(\sigma_i - \lambda, 0). Each iteration costs O(min(m,n)mn)O(\min(m,n) \cdot mn) for the SVD, which is expensive for large matrices. Randomized SVD approximations reduce this in practice.

Connection to Other Methods

Proximal gradient methods sit at a crossroads in optimization.

Gradient descent. When g=0g = 0, ISTA reduces to gradient descent with step size 1/L1/L, and FISTA reduces to Nesterov's accelerated gradient method.

Subgradient methods. When both ff and gg are nonsmooth, the composite splitting is unavailable, and one resorts to subgradient methods. These converge at O(1/k)O(1/\sqrt{k}), strictly slower than ISTA's O(1/k)O(1/k). The proximal framework exploits the partial smoothness of ff to gain a factor of k\sqrt{k}.

Coordinate descent. For separable gg (like the 1\ell_1 norm), coordinate descent updates one coordinate at a time. Each coordinate update is cheap (O(n)O(n) for lasso) and the proximal step on one coordinate is trivial. Coordinate descent can outperform ISTA when dd is large and AA is dense, but ISTA parallelizes better.

ADMM. The alternating direction method of multipliers handles more general structured problems, including those where gg involves a linear operator (e.g., total variation λDx1\lambda\|Dx\|_1 where DD is a difference matrix). ADMM introduces auxiliary variables to decouple the linear operator from the proximal step. It converges at O(1/k)O(1/k) for convex problems.

Mirror descent. Mirror descent and Frank-Wolfe methods use Bregman divergences instead of the Euclidean distance in the proximal step. This is useful when the geometry of the constraint set is non-Euclidean (e.g., the simplex with KL divergence). Proximal gradient methods with Bregman divergences are called Bregman proximal gradient methods.

Second-order methods. Newton's method and quasi-Newton methods can handle smooth parts more aggressively by using curvature information, but incorporating a nonsmooth gg requires proximal-Newton or proximal-quasi-Newton variants that solve a subproblem at each iteration.

Common Confusions

Watch Out

Proximal operator is not projection

The proximal operator for a general gg is not a projection onto a set. It is a generalized projection that balances proximity to the input with minimizing gg. Projection is the special case where gg is an indicator function. For the 1\ell_1 norm, the proximal operator is soft-thresholding, which shrinks coordinates rather than projecting them onto a set.

Watch Out

FISTA does not always beat ISTA in practice

FISTA has a better worst-case rate (O(1/k2)O(1/k^2) vs O(1/k)O(1/k)), but its oscillatory behavior can make it slower on some problems, particularly ill-conditioned or strongly convex ones where the momentum overshoots. Restarting FISTA (resetting momentum when the objective increases) often works better in practice. The theoretical guarantee is about worst-case complexity, not every instance.

Watch Out

Step size 1/L does not require computing L exactly

Many practitioners believe ISTA requires knowing LL exactly. In practice, backtracking line search estimates LL adaptively and sometimes finds a local Lipschitz constant much smaller than the global LL. This can make each step larger and convergence faster than the worst-case bound predicts.

Watch Out

Proximal gradient is not the same as subgradient descent

Subgradient methods apply to general nonsmooth convex problems by replacing gradients with subgradients and using diminishing step sizes. Proximal gradient methods exploit the composite structure f+gf + g where ff is smooth. When this structure exists, the proximal approach is strictly better: O(1/k)O(1/k) vs O(1/k)O(1/\sqrt{k}). Using subgradient descent on a composite problem wastes the smoothness of ff.

Summary

  • Proximal operator: proxg(x)=argminy(g(y)+yx2/2)\mathrm{prox}_g(x) = \arg\min_y (g(y) + \|y-x\|^2/2)
  • ISTA: gradient step on smooth ff, then proximal step on nonsmooth gg
  • ISTA converges at O(1/k)O(1/k); FISTA with momentum at O(1/k2)O(1/k^2)
  • O(1/k2)O(1/k^2) is optimal for first-order methods on this problem class
  • For 1\ell_1 regularization, the proximal operator is soft-thresholding
  • The Moreau envelope smooths a nonsmooth function while preserving minimizers
  • Restarting FISTA with strong convexity gives linear convergence O((1μ/L)k)O((1 - \sqrt{\mu/L})^k)
  • Key special cases: lasso, projected gradient descent, elastic net, nuclear norm minimization

Exercises

ExerciseCore

Problem

Compute proxλ1(x)\mathrm{prox}_{\lambda \|\cdot\|_1}(x) for x=(3,1,0.5)x = (3, -1, 0.5) and λ=1\lambda = 1. Which coordinates become zero?

ExerciseCore

Problem

Compute the proximal operator of g(x)=λ2x2g(x) = \frac{\lambda}{2}\|x\|^2 (the squared 2\ell_2 penalty used in ridge regression). Show that it never sets any coordinate to zero.

ExerciseAdvanced

Problem

Show that the proximal operator of the indicator function ICI_C of a closed convex set CC is the Euclidean projection ΠC\Pi_C.

ExerciseAdvanced

Problem

Prove that the proximal operator is firmly nonexpansive: for any convex gg and any x,yx, y, proxg(x)proxg(y)2xy,proxg(x)proxg(y)\|\mathrm{prox}_g(x) - \mathrm{prox}_g(y)\|^2 \leq \langle x - y, \mathrm{prox}_g(x) - \mathrm{prox}_g(y)\rangle.

ExerciseResearch

Problem

Prove that the Moreau envelope MgγM_g^\gamma is differentiable even when gg is not, and show Mgγ(x)=1γ(xproxγg(x))\nabla M_g^\gamma(x) = \frac{1}{\gamma}(x - \mathrm{prox}_{\gamma g}(x)).

References

Canonical:

  • Beck & Teboulle, "A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems," SIAM J. Imaging Sci. (2009), Sections 2-4
  • Nesterov, Introductory Lectures on Convex Optimization (2004), Chapter 2 (smooth minimization) and Chapter 4 (nonsmooth problems)
  • Rockafellar, Convex Analysis (1970), Chapter 31 (Moreau-Yosida regularization and proximal mappings)

Textbook treatments:

  • Boyd & Vandenberghe, Convex Optimization (2004), Chapter 6 (proximal and projection operators in the context of decomposition)
  • Parikh & Boyd, "Proximal Algorithms," Foundations and Trends in Optimization 1(3), 2014, Sections 1-6 (comprehensive survey with closed-form proximal operators for 30+ functions)
  • Beck, First-Order Methods in Optimization (2017), Chapters 6 (proximal gradient) and 10 (accelerated methods)

Convergence analysis and lower bounds:

  • Nesterov, "A method for solving a convex programming problem with convergence rate O(1/k2)O(1/k^2)," Dokl. Akad. Nauk SSSR 269 (1983)
  • Tseng, "On accelerated proximal gradient methods for convex-concave optimization," SIAM preprint (2008)

Next Topics

The natural next steps from proximal gradient methods:

  • Coordinate descent: an alternative for separable penalties that updates one variable at a time
  • Stochastic gradient descent: scaling proximal methods to large datasets via stochastic proximal gradient (prox-SGD)
  • ADMM: handles composite objectives where gg involves a linear operator
  • Mirror descent: replaces the Euclidean proximal step with Bregman divergences for non-Euclidean geometry

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

2