Proximal Gradient Methods

Sneiderman, Robby

Numerical Optimization

Proximal Gradient Methods

Optimize composite objectives by alternating gradient steps on smooth terms with proximal operators on nonsmooth terms. ISTA and its accelerated variant FISTA.

CoreTier 1StableSupporting~55 min

Prerequisites

Convex Optimization Basics Quasi Newton Methods Subgradients and Subdifferentials

Quiz (3)Prereq Map

Why This Matters

Many ML objectives have the form "smooth loss plus nonsmooth regularizer." Lasso regression is the canonical example: squared error (smooth) plus an $\ell_1$ penalty (nonsmooth). You cannot apply standard gradient descent to the full objective because the $\ell_1$ norm is not differentiable at zero. Subgradient methods apply, but converge slowly ( $O(1/\sqrt{k})$ for convex problems).

Proximal gradient methods solve this cleanly. They split the smooth and nonsmooth parts: a gradient step on the smooth part, then a proximal step on the nonsmooth part. ISTA recovers the $O(1/k)$ rate of gradient descent on smooth functions; FISTA reaches the optimal $O(1/k^2)$ rate via Nesterov acceleration. This splitting strategy is the algorithmic backbone of sparse optimization, constrained optimization, and structured regularization across ML.

Mental Model

Think of two operations in sequence. First, slide downhill along the smooth loss surface (a gradient step on $f$ ). This produces an intermediate point that ignores the nonsmooth regularizer entirely. Second, apply the proximal operator of $g$ to that intermediate point: find the nearest point (in a squared-distance sense) that also keeps $g$ small. For the $\ell_1$ norm, this second step is soft-thresholding, which pushes small coordinates exactly to zero and shrinks large ones toward zero. The two-step rhythm --- gradient, then proximal --- repeats until convergence.

When $g$ is the indicator function of a convex constraint set, the proximal step reduces to Euclidean projection, and the algorithm becomes projected gradient descent.

Formal Setup and Notation

We want to solve the composite minimization problem:

$\min_{x \in \mathbb{R}^d} \; F(x) = f(x) + g(x)$

where $f \colon \mathbb{R}^d \to \mathbb{R}$ is convex with $L$ -Lipschitz continuous gradient, meaning $\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\|$ for all $x, y$ . The function $g \colon \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\}$ is convex, lower semicontinuous, and proper (its domain is nonempty), but possibly nonsmooth.

The Lipschitz constant $L$ of $\nabla f$ controls the step size: we set $t \leq 1/L$ to ensure the gradient step does not overshoot. For quadratic losses like $f(x) = \frac{1}{2}\|Ax - b\|^2$ , the constant is $L = \lambda_{\max}(A^TA)$ , the largest eigenvalue of $A^TA$ .

The composite structure $f + g$ appears throughout ML:

Problem	$f(x)$	$g(x)$
Lasso	$\frac{1}{2}\\|Ax - b\\|^2$	$\lambda\\|x\\|_1$
Ridge	$\frac{1}{2}\\|Ax - b\\|^2$	$\frac{\lambda}{2}\\|x\\|^2$
Elastic net	$\frac{1}{2}\\|Ax - b\\|^2$	$\lambda_1\\|x\\|_1 + \frac{\lambda_2}{2}\\|x\\|^2$
Constrained minimization	$f(x)$	$I_C(x)$ (indicator of set $C$ )
Group lasso	$\frac{1}{2}\\|Ax - b\\|^2$	$\lambda \sum_j \\|x_{G_j}\\|_2$
Nuclear norm regularization	loss on matrix	$\lambda \\|X\\|_*$

Definition

Proximal Operator $prox_{g} (x)$

The proximal operator of a convex function $g$ at point $x$ is:

$\mathrm{prox}_g(x) = \arg\min_{y} \left( g(y) + \frac{1}{2}\|y - x\|^2 \right)$

It finds the point that balances being close to $x$ (the quadratic term) with having small $g$ value. The proximal operator always exists and is unique when $g$ is convex, lower semicontinuous, and proper.

When $g$ has a scaling parameter $\gamma > 0$ , write $\mathrm{prox}_{\gamma g}(x) = \arg\min_y (g(y) + \frac{1}{2\gamma}\|y - x\|^2)$ . The parameter $\gamma$ controls the trade-off: large $\gamma$ penalizes $g$ more heavily; small $\gamma$ keeps the output closer to $x$ .

Key properties of the proximal operator

The proximal operator has several properties that make it well-suited to iterative algorithms:

Firm nonexpansiveness. For any $x, y$ : $\|\mathrm{prox}_g(x) - \mathrm{prox}_g(y)\|^2 \leq \langle x - y, \mathrm{prox}_g(x) - \mathrm{prox}_g(y) \rangle$ This is stronger than nonexpansiveness ( $\|\mathrm{prox}_g(x) - \mathrm{prox}_g(y)\| \leq \|x - y\|$ ) and guarantees that iterating the proximal operator does not amplify errors.
Fixed-point characterization. $x^*$ minimizes $g$ if and only if $x^* = \mathrm{prox}_{\gamma g}(x^*)$ for any $\gamma > 0$ . So minimizers of $g$ are exactly the fixed points of its proximal operator.
Resolvent identity. $\mathrm{prox}_g(x) + \mathrm{prox}_{g^*}(x) = x$ , where $g^*$ is the convex conjugate (Fenchel conjugate) of $g$ . This duality between the proximal operators of $g$ and $g^*$ underlies many algorithmic decompositions.
Composition with affine maps. If $g(x) = h(Ax + b)$ and $A$ is orthogonal, then $\mathrm{prox}_g(x) = A^T(\mathrm{prox}_h(Ax + b) - b)$ . For general $A$ , no simple composition rule exists; this is why ADMM introduces auxiliary variables to handle affine compositions.

Definition

Moreau Envelope $M_{g} (x)$

The Moreau envelope of $g$ with parameter $\gamma > 0$ is:

$M_g^\gamma(x) = \min_{y} \left( g(y) + \frac{1}{2\gamma}\|y - x\|^2 \right)$

It is a smooth approximation of $g$ . Even when $g$ is nonsmooth, the Moreau envelope $M_g^\gamma$ is differentiable with gradient $\nabla M_g^\gamma(x) = \frac{1}{\gamma}(x - \mathrm{prox}_{\gamma g}(x))$ . As $\gamma \to 0$ , the Moreau envelope converges pointwise to $g$ .

The Moreau envelope preserves minimizers: $\arg\min_x g(x) = \arg\min_x M_g^\gamma(x)$ for any $\gamma > 0$ . It also preserves the minimum value. This means we can study the smooth function $M_g^\gamma$ instead of the nonsmooth $g$ without changing the optimization problem's solution set.

The gradient $\nabla M_g^\gamma(x) = \frac{1}{\gamma}(x - \mathrm{prox}_{\gamma g}(x))$ is $\frac{1}{\gamma}$ -Lipschitz continuous. This makes $M_g^\gamma$ a $C^1$ function with bounded curvature, even when $g$ itself is piecewise linear (like the $\ell_1$ norm). The Moreau envelope therefore provides a principled way to "smooth" nonsmooth objectives for analysis.

Closed-Form Proximal Operators

The practical usefulness of proximal methods depends on whether $\mathrm{prox}_g$ has a closed form. For many regularizers used in ML, it does.

Soft-thresholding ( $\ell_1$ norm). For $g(x) = \lambda \|x\|_1$ :

$[\mathrm{prox}_{\lambda \|\cdot\|_1}(x)]_i = \mathrm{sign}(x_i) \max(|x_i| - \lambda, 0)$

This shrinks each coordinate toward zero and sets coordinates with $|x_i| \leq \lambda$ exactly to zero. The $\ell_1$ proximal operator is the mechanism behind sparsity in lasso and sparse recovery.

Indicator function. For $g(x) = I_C(x)$ (zero if $x \in C$ , $+\infty$ otherwise), the proximal operator is the Euclidean projection onto $C$ : $\mathrm{prox}_{I_C}(x) = \Pi_C(x)$ .

$\ell_2$ norm (group soft-thresholding). For $g(x) = \lambda\|x\|_2$ (not squared):

$\mathrm{prox}_{\lambda\|\cdot\|_2}(x) = \left(1 - \frac{\lambda}{\|x\|_2}\right)_+ x$

where $(a)_+ = \max(a, 0)$ . This shrinks the entire vector toward zero and sets $\mathrm{prox}(x) = 0$ when $\|x\|_2 \leq \lambda$ . It is the building block of group lasso, where each group is shrunk as a unit.

Squared $\ell_2$ norm. For $g(x) = \frac{\lambda}{2}\|x\|^2$ (ridge penalty):

$\mathrm{prox}_{\frac{\lambda}{2}\|\cdot\|^2}(x) = \frac{1}{1 + \lambda} x$

This scales all coordinates uniformly. No coordinate is set to zero, which is why ridge shrinks but does not select variables.

Nuclear norm. For $g(X) = \lambda\|X\|_* = \lambda \sum_i \sigma_i(X)$ , the proximal operator applies soft-thresholding to the singular values: compute the SVD $X = U\Sigma V^T$ , threshold the diagonal entries of $\Sigma$ , and reconstruct. This costs one SVD per iteration.

Elastic net. For $g(x) = \lambda_1\|x\|_1 + \frac{\lambda_2}{2}\|x\|^2$ , the proximal operator composes scaling and soft-thresholding: first soft-threshold with parameter $\lambda_1$ , then scale by $\frac{1}{1+\lambda_2}$ .

The ISTA Algorithm

The Iterative Shrinkage-Thresholding Algorithm (ISTA) repeats:

$x_{k+1} = \mathrm{prox}_{t_k g}\!\left(x_k - t_k \nabla f(x_k)\right)$

where $t_k \leq 1/L$ is the step size and $L$ is the Lipschitz constant of $\nabla f$ .

Step by step: (1) compute the gradient $\nabla f(x_k)$ ; (2) take a gradient step on the smooth part, producing an intermediate point $z = x_k - t_k \nabla f(x_k)$ ; (3) apply the proximal operator of $g$ to $z$ , handling the nonsmooth part. The gradient step minimizes a local quadratic model of $f$ around $x_k$ ; the proximal step incorporates $g$ at minimal additional cost.

Why the step size is $1/L$

The step size $t = 1/L$ comes from the quadratic upper bound on smooth convex functions. Because $\nabla f$ is $L$ -Lipschitz, we have the descent lemma:

$f(y) \leq f(x) + \nabla f(x)^T(y - x) + \frac{L}{2}\|y - x\|^2$

Minimizing the right side plus $g(y)$ over $y$ gives exactly the proximal gradient step with $t = 1/L$ . The quadratic upper bound acts as a majorizer: ISTA minimizes this majorizer at each step, guaranteeing that $F(x_{k+1}) \leq F(x_k)$ (monotone decrease).

Backtracking line search

When $L$ is unknown or expensive to compute (e.g., computing $\lambda_{\max}(A^TA)$ for a large matrix), a backtracking line search finds a suitable step size. Start with some $\hat{L}$ , then increase $\hat{L} \leftarrow \eta \hat{L}$ (with $\eta > 1$ , typically $\eta = 2$ ) until the sufficient decrease condition holds:

$f(\mathrm{prox}_{g/\hat{L}}(x - \frac{1}{\hat{L}}\nabla f(x))) \leq f(x) + \nabla f(x)^T(p - x) + \frac{\hat{L}}{2}\|p - x\|^2$

where $p$ is the proximal step output. This avoids computing $L$ exactly while preserving the $O(1/k)$ convergence guarantee.

The FISTA Algorithm

Fast ISTA (FISTA), introduced by Beck and Teboulle (2009), adds Nesterov momentum to accelerate convergence from $O(1/k)$ to $O(1/k^2)$ .

Initialize $x_0 = y_1$ , $t_1 = 1$ . Then repeat:

$x_k = \mathrm{prox}_{s \cdot g}\!\left(y_k - s \cdot \nabla f(y_k)\right)$

$t_{k+1} = \frac{1 + \sqrt{1 + 4t_k^2}}{2}$

$y_{k+1} = x_k + \frac{t_k - 1}{t_{k+1}}(x_k - x_{k-1})$

The momentum coefficient $\frac{t_k - 1}{t_{k+1}}$ starts near zero and approaches $1$ as $k$ grows. At large $k$ , $t_k \approx k/2$ , so the coefficient is approximately $\frac{k-2}{k+1}$ , similar to the Nesterov acceleration schedule for smooth gradient descent.

The extrapolation step $y_{k+1}$ overshoots past $x_k$ in the direction of recent movement. This look-ahead is what distinguishes acceleration from plain gradient descent; the algorithm uses trajectory information, not just the current gradient.

FISTA with monotone variant

Because FISTA's momentum can cause the objective to increase between iterations, a monotone variant (Beck and Teboulle, 2009; O'Donoghue and Candes, 2015) modifies the update:

$y_{k+1} = x_k + \frac{t_k - 1}{t_{k+1}}(x_k - x_{k-1})$

is replaced by choosing $y_{k+1}$ based on whichever of $x_k, x_{k-1}$ has the smaller objective value. This sacrifices some theoretical elegance but avoids the oscillations that slow FISTA on ill-conditioned problems.

Restarting for strongly convex problems

When $f$ is strongly convex with parameter $\mu > 0$ , the optimal rate for first-order methods improves from $O(1/k^2)$ to linear convergence $O((1 - \sqrt{\mu/L})^k)$ . FISTA does not automatically exploit strong convexity, but a restarting scheme does: restart the momentum sequence ( $t_k \leftarrow 1$ ) whenever $F(x_k) > F(x_{k-1})$ or at fixed intervals of $O(\sqrt{L/\mu})$ iterations. This achieves the optimal linear rate for strongly convex composite problems.

The condition number $\kappa = L/\mu$ controls the restart interval. Ill-conditioned problems ( $\kappa \gg 1$ ) restart less frequently; well-conditioned problems restart often.

Main Theorems

Theorem

ISTA Convergence Rate

Statement

Let $F^* = \min_x F(x)$ . ISTA with constant step size $t = 1/L$ satisfies:

$F(x_k) - F^* \leq \frac{L \|x_0 - x^*\|^2}{2k}$

where $x^*$ is a minimizer of $F$ .

Intuition

The convergence rate is $O(1/k)$ : after $k$ iterations, suboptimality shrinks proportionally to $1/k$ . This matches gradient descent on smooth convex functions. The proximal step handles nonsmoothness without degrading the rate. Compare to subgradient methods, which achieve only $O(1/\sqrt{k})$ on the same problem class.

Proof Sketch

The proof uses the descent lemma for smooth functions combined with the optimality condition for the proximal step. Since $\nabla f$ is $L$ -Lipschitz:

$f(x_{k+1}) \leq f(x_k) + \nabla f(x_k)^T(x_{k+1} - x_k) + \frac{L}{2}\|x_{k+1} - x_k\|^2$

The proximal optimality condition gives: $0 \in \partial g(x_{k+1}) + L(x_{k+1} - x_k + \frac{1}{L}\nabla f(x_k))$ . Rearranging and using convexity of $g$ :

$g(x_{k+1}) \leq g(x) + L\langle x_k - \frac{1}{L}\nabla f(x_k) - x_{k+1}, x_{k+1} - x\rangle$

for any $x$ . Combining with the descent inequality and setting $x = x^*$ :

$F(x_{k+1}) - F^* \leq \frac{L}{2}(\|x_k - x^*\|^2 - \|x_{k+1} - x^*\|^2)$

Telescope from $k = 0$ to $K-1$ and use the fact that $\|x_K - x^*\|^2 \geq 0$ to obtain $F(x_K) - F^* \leq \frac{L\|x_0 - x^*\|^2}{2K}$ .

Why It Matters

This rate shows that proximal gradient descent matches gradient descent even though part of the objective is nonsmooth. The proximal operator absorbs the nonsmoothness at no cost to the convergence rate; the price is only computational (evaluating $\mathrm{prox}_g$ per iteration).

Failure Mode

The $O(1/k)$ rate is tight for ISTA: there exist convex instances where the bound is achieved. The rate also depends on $L$ ; if $L$ is large (ill-conditioned smooth part), convergence is slow per iteration. If $L$ is unknown and overestimated, the step size becomes needlessly small. If underestimated, the algorithm may diverge. Backtracking resolves this at the cost of extra gradient evaluations.

report a correction →

Theorem

FISTA Accelerated Convergence Rate

Statement

FISTA satisfies:

$F(x_k) - F^* \leq \frac{2L \|x_0 - x^*\|^2}{(k+1)^2}$

Intuition

Acceleration improves the rate from $O(1/k)$ to $O(1/k^2)$ . After 100 iterations, ISTA reduces the gap by $100\times$ ; FISTA reduces it by $10{,}000\times$ . The momentum term lets the algorithm use trajectory information --- not just the current gradient --- to take longer effective steps.

Proof Sketch

Define a Lyapunov function:

$E_k = t_k^2(F(x_k) - F^*) + \frac{L}{2}\|v_k - x^*\|^2$

where $v_k = x_{k-1} + t_k(x_k - x_{k-1})$ is an auxiliary sequence. The proof shows $E_{k+1} \leq E_k$ by substituting the FISTA update rules and using the descent property of the proximal step. The key algebraic identity is $t_{k+1}^2 - t_{k+1} \leq t_k^2$ , which follows from the specific recurrence $t_{k+1} = (1 + \sqrt{1 + 4t_k^2})/2$ . Since $E_k$ is nonincreasing and the $\|v_k - x^*\|^2$ term is nonneg:

$t_k^2(F(x_k) - F^*) \leq E_k \leq E_1 = F(x_0) - F^* + \frac{L}{2}\|x_0 - x^*\|^2$

Since $t_k \geq (k+1)/2$ , dividing both sides by $t_k^2$ yields the $O(1/k^2)$ bound.

Why It Matters

$O(1/k^2)$ is the optimal rate for first-order methods on this problem class (Nesterov, 1983). No method using only gradient evaluations and proximal operators can do better in the worst case. FISTA achieves this optimality bound for composite objectives with the same per-iteration cost as ISTA.

Failure Mode

FISTA's iterates can oscillate: $F(x_{k+1})$ may exceed $F(x_k)$ . This non-monotonicity makes FISTA harder to use as a subroutine (e.g., stopping criteria based on objective decrease may trigger prematurely). Monotone variants and adaptive restart address this. For strongly convex $f$ , plain FISTA wastes the strong convexity; use restarting or a modified momentum sequence to get linear convergence.

report a correction →

Optimality and Lower Bounds

The $O(1/k^2)$ rate of FISTA is optimal among first-order methods for convex optimization with Lipschitz gradients. Nesterov (1983, 2004) proved a matching lower bound: for any first-order method that queries $\nabla f$ at $k$ points, there exists a convex function with $L$ -Lipschitz gradient such that:

$F(x_k) - F^* \geq \frac{3L\|x_0 - x^*\|^2}{32(k+1)^2}$

So FISTA's upper bound $\frac{2L\|x_0 - x^*\|^2}{(k+1)^2}$ matches the lower bound up to a constant factor. For ISTA, the $O(1/k)$ rate is also tight: there exist instances achieving it.

Under strong convexity ( $f$ has parameter $\mu > 0$ ), the optimal rate improves to $O((1 - \sqrt{\mu/L})^k)$ (linear convergence). Plain ISTA achieves $O((1 - \mu/L)^k)$ ; FISTA with restart achieves the optimal $O((1 - \sqrt{\mu/L})^k)$ . The gap between $1 - \mu/L$ and $1 - \sqrt{\mu/L}$ is significant when the condition number $\kappa = L/\mu$ is large: ISTA needs $O(\kappa)$ iterations for $\epsilon$ -accuracy; FISTA with restart needs $O(\sqrt{\kappa})$ .

Canonical Examples

Example

Lasso via ISTA

The Lasso problem is $\min_x \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1$ . Here $f(x) = \frac{1}{2}\|Ax - b\|^2$ with $\nabla f(x) = A^T(Ax - b)$ and $L = \|A^TA\| = \sigma_{\max}^2(A)$ , the squared largest singular value of $A$ . The proximal step is coordinate-wise soft-thresholding. Each ISTA iteration: compute $z = x_k - \frac{1}{L} A^T(Ax_k - b)$ , then $x_{k+1} = \mathrm{sign}(z) \odot \max(|z| - \lambda/L, 0)$ .

For a $1000 \times 500$ matrix $A$ , each iteration costs $O(nd)$ for the matrix-vector products $Ax_k$ and $A^T(\cdot)$ , plus $O(d)$ for the soft-thresholding. The total per-iteration cost is dominated by the two matrix-vector multiplies.

Example

Projected gradient descent as a special case

When $g = I_C$ is the indicator of a convex set $C$ , the proximal operator is projection onto $C$ . So ISTA becomes projected gradient descent: $x_{k+1} = \Pi_C(x_k - t \nabla f(x_k))$ . Constrained smooth optimization is a special case of proximal gradient methods. Common constraint sets with cheap projections include the simplex ( $O(d \log d)$ ), the $\ell_2$ ball ( $O(d)$ ), and box constraints ( $O(d)$ coordinate-wise clipping).

Example

Elastic net

The elastic net combines $\ell_1$ and $\ell_2$ penalties: $g(x) = \lambda_1\|x\|_1 + \frac{\lambda_2}{2}\|x\|^2$ . Its proximal operator is:

$\mathrm{prox}_g(x) = \frac{1}{1 + \lambda_2}\mathrm{sign}(x) \odot \max(|x| - \lambda_1, 0)$

First soft-threshold with parameter $\lambda_1$ , then scale by $\frac{1}{1+\lambda_2}$ . This produces solutions that are sparse (from the $\ell_1$ part) and have bounded $\ell_2$ norm (from the $\ell_2$ part). The elastic net proximal operator is separable across coordinates, so it costs $O(d)$ .

Example

Low-rank matrix completion

For matrix completion with nuclear norm regularization: $\min_X \frac{1}{2}\sum_{(i,j) \in \Omega}(X_{ij} - M_{ij})^2 + \lambda\|X\|_*$ , where $\Omega$ is the set of observed entries. Here $f$ is a smooth quadratic loss on observed entries, and $g(X) = \lambda\|X\|_*$ . The proximal operator requires a singular value decomposition: compute $X = U\Sigma V^T$ , then soft-threshold the singular values $\sigma_i \mapsto \max(\sigma_i - \lambda, 0)$ . Each iteration costs $O(\min(m,n) \cdot mn)$ for the SVD, which is expensive for large matrices. Randomized SVD approximations reduce this in practice.

Connection to Other Methods

Proximal gradient methods sit at a crossroads in optimization.

Gradient descent. When $g = 0$ , ISTA reduces to gradient descent with step size $1/L$ , and FISTA reduces to Nesterov's accelerated gradient method.

Subgradient methods. When both $f$ and $g$ are nonsmooth, the composite splitting is unavailable, and one resorts to subgradient methods. These converge at $O(1/\sqrt{k})$ , strictly slower than ISTA's $O(1/k)$ . The proximal framework exploits the partial smoothness of $f$ to gain a factor of $\sqrt{k}$ .

Coordinate descent. For separable $g$ (like the $\ell_1$ norm), coordinate descent updates one coordinate at a time. Each coordinate update is cheap ( $O(n)$ for lasso) and the proximal step on one coordinate is trivial. Coordinate descent can outperform ISTA when $d$ is large and $A$ is dense, but ISTA parallelizes better.

ADMM. The alternating direction method of multipliers handles more general structured problems, including those where $g$ involves a linear operator (e.g., total variation $\lambda\|Dx\|_1$ where $D$ is a difference matrix). ADMM introduces auxiliary variables to decouple the linear operator from the proximal step. It converges at $O(1/k)$ for convex problems.

Mirror descent. Mirror descent and Frank-Wolfe methods use Bregman divergences instead of the Euclidean distance in the proximal step. This is useful when the geometry of the constraint set is non-Euclidean (e.g., the simplex with KL divergence). Proximal gradient methods with Bregman divergences are called Bregman proximal gradient methods.

Second-order methods. Newton's method and quasi-Newton methods can handle smooth parts more aggressively by using curvature information, but incorporating a nonsmooth $g$ requires proximal-Newton or proximal-quasi-Newton variants that solve a subproblem at each iteration.

Common Confusions

Watch Out

Proximal operator is not projection

The proximal operator for a general $g$ is not a projection onto a set. It is a generalized projection that balances proximity to the input with minimizing $g$ . Projection is the special case where $g$ is an indicator function. For the $\ell_1$ norm, the proximal operator is soft-thresholding, which shrinks coordinates rather than projecting them onto a set.

Watch Out

FISTA does not always beat ISTA in practice

FISTA has a better worst-case rate ( $O(1/k^2)$ vs $O(1/k)$ ), but its oscillatory behavior can make it slower on some problems, particularly ill-conditioned or strongly convex ones where the momentum overshoots. Restarting FISTA (resetting momentum when the objective increases) often works better in practice. The theoretical guarantee is about worst-case complexity, not every instance.

Watch Out

Step size 1/L does not require computing L exactly

Many practitioners believe ISTA requires knowing $L$ exactly. In practice, backtracking line search estimates $L$ adaptively and sometimes finds a local Lipschitz constant much smaller than the global $L$ . This can make each step larger and convergence faster than the worst-case bound predicts.

Watch Out

Proximal gradient is not the same as subgradient descent

Subgradient methods apply to general nonsmooth convex problems by replacing gradients with subgradients and using diminishing step sizes. Proximal gradient methods exploit the composite structure $f + g$ where $f$ is smooth. When this structure exists, the proximal approach is strictly better: $O(1/k)$ vs $O(1/\sqrt{k})$ . Using subgradient descent on a composite problem wastes the smoothness of $f$ .

Summary

Proximal operator: $\mathrm{prox}_g(x) = \arg\min_y (g(y) + \|y-x\|^2/2)$
ISTA: gradient step on smooth $f$ , then proximal step on nonsmooth $g$
ISTA converges at $O(1/k)$ ; FISTA with momentum at $O(1/k^2)$
$O(1/k^2)$ is optimal for first-order methods on this problem class
For $\ell_1$ regularization, the proximal operator is soft-thresholding
The Moreau envelope smooths a nonsmooth function while preserving minimizers
Restarting FISTA with strong convexity gives linear convergence $O((1 - \sqrt{\mu/L})^k)$
Key special cases: lasso, projected gradient descent, elastic net, nuclear norm minimization

Exercises

ExerciseCore

Problem

Compute $\mathrm{prox}_{\lambda \|\cdot\|_1}(x)$ for $x = (3, -1, 0.5)$ and $\lambda = 1$ . Which coordinates become zero?

ExerciseCore

Problem

Compute the proximal operator of $g(x) = \frac{\lambda}{2}\|x\|^2$ (the squared $\ell_2$ penalty used in ridge regression). Show that it never sets any coordinate to zero.

ExerciseAdvanced

Problem

Show that the proximal operator of the indicator function $I_C$ of a closed convex set $C$ is the Euclidean projection $\Pi_C$ .

ExerciseAdvanced

Problem

Prove that the proximal operator is firmly nonexpansive: for any convex $g$ and any $x, y$ , $\|\mathrm{prox}_g(x) - \mathrm{prox}_g(y)\|^2 \leq \langle x - y, \mathrm{prox}_g(x) - \mathrm{prox}_g(y)\rangle$ .

ExerciseResearch

Problem

Prove that the Moreau envelope $M_g^\gamma$ is differentiable even when $g$ is not, and show $\nabla M_g^\gamma(x) = \frac{1}{\gamma}(x - \mathrm{prox}_{\gamma g}(x))$ .

References

Canonical:

Beck & Teboulle, "A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems," SIAM J. Imaging Sci. (2009), Sections 2-4
Nesterov, Introductory Lectures on Convex Optimization (2004), Chapter 2 (smooth minimization) and Chapter 4 (nonsmooth problems)
Rockafellar, Convex Analysis (1970), Chapter 31 (Moreau-Yosida regularization and proximal mappings)

Textbook treatments:

Boyd & Vandenberghe, Convex Optimization (2004), Chapter 6 (proximal and projection operators in the context of decomposition)
Parikh & Boyd, "Proximal Algorithms," Foundations and Trends in Optimization 1(3), 2014, Sections 1-6 (comprehensive survey with closed-form proximal operators for 30+ functions)
Beck, First-Order Methods in Optimization (2017), Chapters 6 (proximal gradient) and 10 (accelerated methods)

Convergence analysis and lower bounds:

Nesterov, "A method for solving a convex programming problem with convergence rate $O(1/k^2)$ ," Dokl. Akad. Nauk SSSR 269 (1983)
Tseng, "On accelerated proximal gradient methods for convex-concave optimization," SIAM preprint (2008)

Next Topics

The natural next steps from proximal gradient methods:

Coordinate descent: an alternative for separable penalties that updates one variable at a time
Stochastic gradient descent: scaling proximal methods to large datasets via stochastic proximal gradient (prox-SGD)
ADMM: handles composite objectives where $g$ involves a linear operator
Mirror descent: replaces the Euclidean proximal step with Bregman divergences for non-Euclidean geometry

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Convex Optimization Basicslayer 1 · tier 1
Subgradients and Subdifferentialslayer 1 · tier 1
Quasi-Newton Methodslayer 2 · tier 1

Derived topics

2

Stochastic Gradient Descent Convergencelayer 2 · tier 1
Coordinate Descentlayer 2 · tier 2

Graph-backed continuations

Coordinate Descent Stochastic Gradient Descent Convergence

Why This Matters

Mental Model

Formal Setup and Notation

Key properties of the proximal operator

Closed-Form Proximal Operators

The ISTA Algorithm

Why the step size is 1/L1/L1/L

Backtracking line search

The FISTA Algorithm

FISTA with monotone variant

Restarting for strongly convex problems

Main Theorems

Optimality and Lower Bounds

Canonical Examples

Connection to Other Methods

Common Confusions

Summary

Exercises

References

Next Topics

Required before and derived from this topic

Required prerequisites

Derived topics

Why the step size is $1/L$