Numerical Optimization
Augmented Lagrangian and ADMM
The augmented Lagrangian fixes the conditioning problem of pure penalties by adding dual feedback. ADMM uses that idea to split structured convex problems into alternating subproblems with primal and dual residual diagnostics.
Prerequisites
Why This Matters
Many optimization problems in ML are structured:
- Lasso has a smooth least-squares loss plus a non-smooth penalty.
- Distributed learning has many local objectives that must agree on one model.
- Matrix completion has a data-fit term and a low-rank or nuclear-norm term.
- Constrained estimation has a likelihood objective plus feasibility conditions.
A single black-box optimizer can ignore that structure. ADMM splits the problem so each part gets the solver it deserves, while a dual variable remembers how badly the split variables disagree.
The Problem With Pure Penalties
For
a pure quadratic penalty minimizes
As grows, feasibility improves, but conditioning gets worse. The penalty method is asking one scalar to do two jobs: enforce the constraint and keep the subproblem numerically healthy. That tension is why pure penalties can be painfully slow.
Augmented Lagrangian
Augmented Lagrangian
For the equality-constrained problem subject to , the augmented Lagrangian is
where is the Lagrange multiplier and is a penalty parameter.
Method of Multipliers
The method of multipliers alternates
then
The quadratic term discourages infeasibility. The multiplier update learns the right linear price for violating the constraint. This is the central improvement over a pure penalty method: a finite, moderate can work because adapts.
Method Of Multipliers Convergence
Statement
For the equality-constrained convex problem, the method of multipliers drives the primal residual to zero and converges to the optimal value under standard saddle-point assumptions. When the optimal multiplier is not unique, the multiplier sequence may converge to one valid multiplier rather than a uniquely determined one.
Intuition
The dual update is a feedback controller. If the constraint is violated, the next subproblem receives a stronger linear signal against that violation. The method does not need to become exact.
Why It Matters
This is the conceptual ancestor of ADMM. If you understand why dual feedback fixes pure penalties, ADMM stops looking like a bag of mysterious alternating formulas.
Failure Mode
If the subproblem is solved poorly or the problem is non-convex without extra assumptions, the dual feedback can amplify bad iterates instead of stabilizing them.
ADMM As Splitting
ADMM targets the split problem
The scaled ADMM updates are
Here is the scaled dual variable. ADMM is not just "alternate and ." The dual update is what makes the split remember disagreement.
Residuals: The Dashboard That Matters
The primal residual is
It measures feasibility. The dual residual, in the common two-block scaled form, is
It measures how much the -update is changing the dual stationarity conditions.
| If this happens | Meaning | Response |
|---|---|---|
| primal residual high, dual residual low | split variables still disagree | increase cautiously |
| dual residual high, primal residual low | penalty is forcing agreement too aggressively | decrease cautiously |
| both decrease slowly | subproblem or scaling issue | rescale variables or precondition |
| objective improves but residuals stall | not solving the constrained problem yet | do not trust objective alone |
Convergence
Two-Block Convex ADMM Convergence
Statement
For the two-block convex problem, ADMM produces primal residuals satisfying , dual residuals satisfying , and objective values converging to the optimal value. Standard ergodic convergence certificates are under the classical assumptions; stronger structure can yield faster rates.
Intuition
The - and -updates improve the augmented Lagrangian in directions that respect each function's structure, while the dual update prevents the split variables from drifting apart.
Proof Sketch
The standard proof compares each iterate to a primal-dual saddle point and constructs a Lyapunov quantity involving the dual error and constraint residual. The optimality conditions for the - and -subproblems make successive differences telescope. Summing those inequalities forces residuals to vanish and yields sublinear averaged convergence bounds.
Why It Matters
The theorem explains why ADMM is reliable for convex splitting even when is non-smooth. It also explains what to log: primal residual, dual residual, objective, and penalty parameter.
Failure Mode
Naive multi-block ADMM with three or more primal blocks can diverge without extra assumptions or modifications. Non-convex ADMM has problem-specific stationary-point guarantees, not generic global-optimality guarantees.
Core Example: Lasso
The Lasso problem is
Split it as :
Then the -update is ridge regression:
and the -update is soft-thresholding:
This is why ADMM is loved in structured statistics: one step is linear algebra, one step is a proximal operator, and the dual variable enforces consistency.
Where ADMM Shines
- Moderate-accuracy structured convex problems: Lasso, graphical lasso variants, robust PCA, trend filtering, signal reconstruction.
- Distributed consensus: local workers solve local subproblems and agree through a global variable.
- Proximal splitting: smooth and non-smooth terms can use different specialized solvers.
- Interpretability of logs: residual plots expose feasibility and stationarity separately.
Where ADMM Is Not Magic
ADMM is often not the best choice for high-precision small convex problems; an interior point method may win. It is also not a replacement for SGD or Adam in most neural-network training. Non-convex ADMM can be useful, but it should be treated as a specialized splitting heuristic unless you have assumptions proving stationarity.
Common Confusions
Rho is not a model regularization parameter
The penalty parameter changes the optimization path, not the target constrained solution in the exact convex setup. The Lasso regularization strength is ; ADMM's is a solver parameter.
Small objective value is not enough
For constrained splitting problems, a low objective with a large primal residual may be infeasible. Always inspect primal and dual residuals.
ADMM is not just nonlinear Gauss-Seidel
ADMM uses Gauss-Seidel-style alternating primal updates, but the augmented Lagrangian and dual update are essential. Removing the dual update turns it into a penalty-style alternating method with different behavior.
Two-block theory does not automatically extend to many blocks
The classical convergence theorem is for two primal blocks. More blocks require grouping, proximal regularization, strong convexity assumptions, or other safeguards.
Q&A For Mastery
What is the first ADMM debugging plot? Plot , , the objective, and on the same run. If only the objective is logged, the main ADMM story is missing.
Why does scaled form use instead of ? Because makes the quadratic terms cleaner and reveals the residual accumulation pattern .
When should I adapt ? When primal and dual residuals are badly imbalanced for many iterations. Adaptation should be conservative and should rescale consistently if the implementation changes .
What To Remember
- Pure penalties need large and become ill-conditioned.
- The augmented Lagrangian adds dual feedback, so finite can work.
- ADMM splits into alternating subproblems plus a dual residual update.
- Primal residual means feasibility; dual residual means stationarity movement.
- Classical ADMM convergence is a two-block convex result. Be cautious with many blocks and non-convex objectives.
Exercises
Problem
For the split problem subject to , write the scaled ADMM updates.
Problem
In Lasso ADMM, why is the -update a ridge regression and the -update a soft-thresholding step?
Problem
Why can a naive three-block ADMM fail even if every subproblem is convex?
References
Canonical:
- Hestenes, "Multiplier and Gradient Methods" (1969).
- Powell, "A Method for Nonlinear Constraints in Minimization Problems" (1969).
- Gabay and Mercier, "A Dual Algorithm for the Solution of Nonlinear Variational Problems via Finite Element Approximation" (1976).
- Boyd, Parikh, Chu, Peleato, and Eckstein, "Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers" (2011).
Further:
- Bertsekas, Constrained Optimization and Lagrange Multiplier Methods (1982).
- Eckstein and Bertsekas, "On the Douglas-Rachford Splitting Method and the Proximal Point Algorithm for Maximal Monotone Operators" (1992).
- Parikh and Boyd, "Proximal Algorithms" (2014).
- Hong, Luo, and Razaviyayn, "Convergence Analysis of Alternating Direction Method of Multipliers for a Family of Nonconvex Problems" (2016).
Next Topics
- Trust region methods: constrained local models where the constraint controls step reliability.
- Interior point methods: high-accuracy constrained convex optimization from a different geometric angle.
Last reviewed: April 25, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Convex Optimization Basicslayer 1 · tier 1
- Convex Dualitylayer 2 · tier 1
- Projected Gradient Descentlayer 2 · tier 2
- Nonlinear Gauss-Seidellayer 3 · tier 3
Derived topics
2- Trust Region Methodslayer 2 · tier 2
- Interior Point Methodslayer 3 · tier 3
Graph-backed continuations