Optimization Function Classes
Subgradients and Subdifferentials
The non-smooth generalization of the gradient for convex functions. Subgradients enable optimality conditions, calculus rules, and convergence guarantees for L1-regularized problems, hinge loss SVMs, and proximal algorithms where the objective is not differentiable.
Prerequisites
Why This Matters
Most modern ML objectives are not smooth. The L1 penalty in lasso regression is not differentiable at zero. The hinge loss in support vector machines kinks at the margin. The ReLU activation has no derivative at zero. The total-variation regularizer has gradient discontinuities at every level set.
Standard gradient-based optimization assumes the gradient exists, which is exactly what fails on these problems. The right replacement is the subgradient, which generalizes the gradient to convex but non-smooth functions. Every convex function has a non-empty subdifferential at every interior point of its domain, even where the gradient does not exist. This is what makes proximal gradient methods, the lasso, hinge-loss SVMs, and ADMM work as theory rather than heuristics.
This page is the foundational reference. Convex-optimization-basics covers the smooth case; this page covers the non-smooth case that nearly every sparsity-inducing or regularized ML method actually uses.
Definition
Subgradient
Let be convex. A vector is a subgradient of at if The set of all subgradients of at is the subdifferential:
A subgradient is any vector such that the affine function stays below everywhere and touches it at . This is the convex-analytic generalization of the tangent line: at a smooth convex function, only one such affine support exists (the tangent), and . At a kink, an entire family of supporting affine functions exists, and is the set of their slopes.
Examples
Absolute value on .
Away from zero, is differentiable and the subdifferential is the singleton containing the derivative. At zero, every slope in gives a supporting line that stays below . The subdifferential at the kink is the closed interval of all such slopes.
norm on .
Coordinate-wise: each is the sign of at non-zero coordinates and any value in at zero coordinates. This is the subdifferential that drives soft-thresholding in proximal methods.
Hinge loss .
The slope is in the loss region, in the no-loss region, and the entire interval at the kink. The kink at is exactly the SVM margin.
Indicator function of a closed convex set .
This is the normal cone to at : the set of outward-pointing directions that do not enter . At interior points, ; on the boundary, the normal cone is non-trivial. This is how convex constraints enter optimality conditions.
Smooth convex . at every point of differentiability.
Existence
Existence of Subgradients
Statement
For any proper convex and any point in the relative interior of , the subdifferential is non-empty, closed, convex, and bounded.
If is differentiable at , then .
If is on the boundary of or is not lower semicontinuous at , the subdifferential may be empty.
Intuition
Geometrically: a convex function has supporting hyperplanes at every interior point of its domain (this is the supporting hyperplane theorem applied to the epigraph). Each supporting hyperplane has a normal vector whose first coordinates form a subgradient. The set of such normals forms a closed convex bounded set, which is exactly the subdifferential.
Proof Sketch
Apply the supporting hyperplane theorem to the epigraph at the boundary point . The supporting hyperplane has normal for some (the is forced by the epigraph being "-monotone"). The supporting condition becomes for all , which is exactly the subgradient inequality. Closure and convexity of follow because it is an intersection of half-spaces. Boundedness on the interior follows from being locally Lipschitz on the relative interior of its domain.
Why It Matters
Existence on the interior is what licenses every algorithm that picks "any subgradient" at each step. Without existence you could not even define a subgradient method, let alone analyze it. The boundary caveat matters for indicator functions of constraints, where the subdifferential at constraint boundaries is the normal cone (which can be unbounded).
Failure Mode
The subdifferential is empty at points outside , and can be empty at boundary points where jumps from finite to . Example: for , for . At , every "subgradient" would need for all , which forces as . No finite works, so .
Optimality Condition
The single most important consequence of the subdifferential framework: it gives a clean optimality condition for non-smooth convex minimization.
Subdifferential Optimality Condition
Statement
is a global minimizer of if and only if This is the non-smooth analogue of for smooth convex functions.
Intuition
The subgradient inequality with becomes for all , which is exactly the definition of being a global minimizer. Conversely, if is a global minimizer, the constant subgradient satisfies the defining inequality.
Proof Sketch
If : By the subgradient inequality at , for all . So is a global minimizer.
If is a global minimizer: Take . Then holds for all , so .
Why It Matters
This is the optimality condition used to derive the soft-thresholding operator (the proximal operator of the norm), to prove KKT conditions for constrained problems, and to characterize lasso solutions. For lasso , the optimality condition becomes , which gives the explicit characterization that for zero coordinates of and equality with the right sign for non-zero coordinates.
Failure Mode
This optimality condition characterizes global minimizers; for non-convex , (with generalized to Clarke subgradients) is necessary but not sufficient for local minimality, and saddle points satisfy it. The full strength of "iff global minimum" requires convexity.
Subgradient Calculus
The subdifferential satisfies calculus rules analogous to gradient rules, with a few caveats around equality versus containment.
Sum rule. For convex , , with equality if .
Scalar multiplication. For , .
Affine pre-composition. For convex and , , .
Pointwise maximum. For convex and , where is the set of "active" functions at .
The maximum-rule is what gives the subdifferential of the hinge loss and the dual norm, and it is the source of the convex hull appearing in many non-smooth optimality conditions.
The Subgradient Method
The simplest non-smooth optimization algorithm replaces the gradient in gradient descent with any subgradient.
Algorithm. Choose step sizes . At iteration :
- Pick any .
- Update .
Unlike gradient descent on smooth functions, the subgradient method is not a descent method: can exceed even with optimal step sizes. The standard guarantee is on the best iterate.
Subgradient Method Convergence Rate
Statement
For convex with for all and step sizes :
With the optimal constant-step-size choice , the bound becomes where is the average iterate.
Intuition
Each subgradient step reduces the distance to by an amount proportional to but increases it by from the noise-like subgradient variation. Balancing these at gives the rate.
Proof Sketch
. By convexity, . Summing over iterations and rearranging: . For , both the left numerator and scale as , giving the bound on the best iterate.
Why It Matters
The rate is optimal among first-order methods for non-smooth convex optimization (Nemirovski-Yudin 1983). This gap from the smooth rate is what motivates proximal methods: when the non-smooth term has a closed-form proximal operator (like the norm), proximal-gradient methods recover the smooth rate by handling non-smoothness exactly.
Failure Mode
The rate degrades with problem conditioning but cannot be improved without structural assumptions. For strongly convex , step sizes give , but for general non-smooth convex problems, is tight. The method also requires knowing and for optimal step-size selection; poor choices can slow convergence or cause divergence.
Common Confusions
Subgradient method is not a descent method
For smooth convex functions with gradient descent, every step reduces the objective. For non-smooth convex functions with the subgradient method, the objective can increase between iterations. The convergence guarantee is on the best iterate seen so far, , not on . This is why averaging schemes (Polyak averaging, Nemirovski-Yudin) are common in subgradient analyses.
Subgradients are only defined for convex functions
For non-convex functions, the convex-analytic subgradient does not exist in general. The Clarke subdifferential and the limiting subdifferential extend the concept to locally Lipschitz non-convex functions and provide necessary conditions for minimality, but the clean "iff global min" of the convex case is lost. ReLU networks are not convex, so the "subgradients" used in their training are Clarke subgradients (any element of the convex hull of limit-of-gradient sequences), not the subgradients defined here.
The subdifferential of a sum can be larger than the sum of subdifferentials
, with equality only under a constraint qualification (relint domain intersection non-empty). Without that qualification, strict containment is possible.
The standard textbook example (Rockafellar 1970, §23) uses functions whose effective domains touch at a single boundary point. On , let Then and , so — the qualification does hold and sum equality is recovered. To see strict containment, take instead (indicator of ) and on . Their domains intersect only at , with empty relative interior intersection, so the qualification fails. Both indicators have finite normal cones at : and , giving . But , so as well — equality, not strict containment, in this particular case. Genuine strict containment requires more delicate examples; see Rockafellar §23 Theorem 23.8 for one. The qualification "relint domains intersect" is what guarantees you do not need to worry about these pathologies.
Summary
- A subgradient of a convex at is any such that for all . The set of subgradients is the subdifferential .
- Subdifferentials exist (non-empty, closed, convex, bounded) at every point in the relative interior of .
- is a global minimizer of convex iff . This is the non-smooth analogue of .
- Subgradient calculus rules (sum, chain, pointwise max) follow gradient rules with constraint-qualification caveats.
- The subgradient method runs at , slower than gradient descent on smooth functions; proximal methods recover when the non-smooth term has a tractable proximal operator.
Exercises
Problem
Compute the subdifferential of (the ReLU function viewed as a univariate convex function) at every point. Then verify the optimality condition for the global minimizer.
Problem
Derive the soft-thresholding operator from the subdifferential optimality condition. Specifically, show that the unique solution to for is given by This is the proximal operator of at the point .
References
Canonical:
- Rockafellar, "Convex Analysis" (Princeton, 1970), Sections 23-25
- Hiriart-Urruty and Lemarechal, "Fundamentals of Convex Analysis" (Springer, 2001), Chapter D
- Boyd and Vandenberghe, "Convex Optimization" (Cambridge, 2004), Section 3.1.5 (gradient and subgradient overlap), Appendix C
Convergence theory:
- Nesterov, "Introductory Lectures on Convex Optimization" (Springer, 2004), Section 3.2 (subgradient method)
- Bubeck, "Convex Optimization: Algorithms and Complexity" (Foundations and Trends in ML, 2015), Sections 3.1-3.2
ML applications:
- Beck, "First-Order Methods in Optimization" (SIAM, 2017), Chapters 3, 6, 10 (proximal operators, subgradient computations for ML losses)
- Parikh and Boyd, "Proximal Algorithms" (Foundations and Trends in Optimization, 2014)
Next Topics
- Proximal gradient methods: how to use subgradient structure efficiently when the non-smooth term is "simple"
- Lasso regression: the canonical application via soft-thresholding
- Convex duality: KKT conditions are subgradient optimality conditions for the Lagrangian
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Convex Optimization Basicslayer 1 · tier 1
Derived topics
3- Convex Dualitylayer 2 · tier 1
- Lasso Regressionlayer 2 · tier 1
- Proximal Gradient Methodslayer 2 · tier 1
Graph-backed continuations