Numerical Optimization
Conjugate Gradient Methods
Conjugate gradient solves large symmetric positive definite linear systems using matrix-vector products, Krylov subspaces, A-conjugate directions, condition-number-dependent rates, and preconditioning.
Prerequisites
Why This Matters
Many ML and numerical optimization problems require solving
where is too large to factor. Direct Cholesky costs cubic time and requires explicit matrix storage. Conjugate gradient (CG) uses only matrix-vector products , so it works when is sparse, structured, implicit, or available through Hessian-vector products.
This is why CG appears in least squares, Gaussian processes, kernel methods, natural gradients, Hessian-free optimization, and TRPO-style policy optimization.
Quadratic View
For symmetric positive definite , solving is equivalent to minimizing
The gradient is
The residual is therefore the negative gradient. CG is an optimization method for this special quadratic, but it is also a linear solver.
A-Conjugacy
A-Conjugate Directions
Nonzero vectors are -conjugate when
This is ordinary orthogonality under the inner product .
Steepest descent uses directions that are orthogonal in the Euclidean sense only at consecutive steps. CG uses directions that are mutually orthogonal in the geometry induced by . That is the key reason it avoids the classic zigzag on elongated ellipses.
Krylov Subspaces
Krylov Subspace
Starting from residual , the order- Krylov subspace is .
CG chooses from . Each iteration expands the search space by one matrix-vector product.
CG Variational Characterization
Statement
The th CG iterate is the unique minimizer of over the affine Krylov space:
Equivalently, minimizes the -norm error over that same space.
Intuition
Each iteration adds one new direction. CG picks the best point available in all directions discovered so far, not merely the best point along the newest direction.
Proof Sketch
First-order optimality over the affine subspace requires the residual to be orthogonal to the Krylov subspace. The CG recurrences maintain this orthogonality while generating an -conjugate basis for the same Krylov space.
The Algorithm
For SPD and initial guess :
- Set and .
- Compute .
- Set .
- Set .
- Compute .
- Set .
Each iteration uses one matrix-vector product, a few dot products, and vector saxpy operations. That is the computational win.
Main Guarantees
Finite Termination In Exact Arithmetic
Statement
CG reaches the exact solution in at most iterations for an by SPD matrix. More sharply, it terminates in at most the degree of the minimal polynomial of relative to the initial residual.
Intuition
There can be at most mutually -conjugate nonzero directions in . Once the search space contains the exact error direction, CG has no error left.
Proof Sketch
The CG directions are -conjugate and span the Krylov spaces. The variational characterization says is optimal over the current Krylov space. When the Krylov space captures the whole invariant subspace generated by the initial residual, the error is zero.
Failure Mode
Floating-point arithmetic destroys exact conjugacy. Practical CG stops by tolerance, not by waiting for exactly iterations.
Condition-Number Convergence Bound
Statement
Let . Then
Intuition
CG depends on the square root of the condition number in this worst-case bound. That is a major improvement over steepest descent on ill-conditioned quadratics.
Proof Sketch
CG error can be written as where is a degree- polynomial with . The method chooses the best such polynomial for the spectrum of . Chebyshev polynomials give the stated worst-case bound on the spectral interval.
Why It Matters
The bound explains preconditioning: change the system so the eigenvalues cluster and the effective condition number shrinks.
Failure Mode
This bound can be pessimistic. Eigenvalue clustering often matters more than the raw condition number. If has only a few distinct eigenvalue clusters, CG can converge much faster than the bound predicts.
Preconditioning
Preconditioning replaces the original system by one with a better spectrum. A left preconditioner uses
where is cheap to solve with and approximates . For SPD systems, the preconditioned method is usually written in a symmetric form so the CG assumptions remain valid.
Good preconditioners reduce iterations. Bad preconditioners add solve cost without improving the spectrum enough.
| Preconditioner | Useful when | Tradeoff |
|---|---|---|
| diagonal / Jacobi | diagonal scaling is poor | cheap, weak |
| incomplete Cholesky | sparse SPD systems | stronger, setup cost |
| low-rank plus diagonal | kernel or Hessian approximations | uses structure |
| multigrid | PDE-like systems | very strong when geometry fits |
Nonlinear Conjugate Gradient
For a general smooth objective , nonlinear CG replaces residuals with negative gradients and uses a line search:
Two common formulas:
- Fletcher-Reeves: .
- Polak-Ribiere: .
Nonlinear CG no longer has exact finite termination. Strong Wolfe line search and periodic restarts matter because conjugacy is only approximate away from quadratics.
ML Connections
- Least squares: CG solves normal equations or better-conditioned equivalent systems when direct factorization is too expensive.
- Gaussian processes: kernel systems can be solved by CG using fast kernel-vector products.
- Natural gradient and TRPO: CG approximately solves Fisher-vector systems without forming the Fisher matrix.
- Hessian-free optimization: CG solves Newton-like systems using Hessian-vector products.
Common Confusions
CG is not momentum
Momentum averages past gradients. CG constructs directions that are conjugate under . For quadratics, that geometry gives finite termination in exact arithmetic; momentum does not.
Standard CG requires SPD matrices
If is not symmetric positive definite, standard CG is not the right method. Use MINRES for symmetric indefinite systems, or GMRES/BiCGSTAB for non-symmetric systems.
The residual you compute may not equal the true residual forever
Floating-point recurrence updates can drift from . Robust implementations occasionally recompute the true residual.
Normal equations can square the condition number
Applying CG to may worsen conditioning. When possible, use methods designed for least squares or a better formulation.
Q&A For Mastery
Why is CG fast when it uses no Hessian inverse? It builds a sequence of Krylov subspaces from repeated matrix-vector products. Those subspaces capture spectral information about without factorizing .
What is the first debugging plot? Plot relative residual norm versus iteration on a log scale. A flat curve usually means poor conditioning, bad preconditioning, or loss of conjugacy.
When should I stop? Use a residual tolerance tied to the downstream task. Solving the linear system to machine precision is often wasted if the outer optimization or data noise dominates.
What To Remember
- CG solves SPD systems using matrix-vector products.
- It minimizes the quadratic over a growing Krylov subspace.
- -conjugate directions prevent undoing earlier progress.
- Exact arithmetic gives finite termination; floating point gives tolerance-based convergence.
- Preconditioning is often the difference between useful CG and stalled CG.
Exercises
Problem
Why must be positive in the CG step-size formula?
Problem
If has exactly distinct eigenvalues, explain why CG can terminate in at most iterations in exact arithmetic.
Problem
Why can preconditioning improve CG even if it makes each iteration more expensive?
References
Canonical:
- Hestenes and Stiefel, "Methods of Conjugate Gradients for Solving Linear Systems" (1952).
- Trefethen and Bau, Numerical Linear Algebra, Lectures 38-39.
- Nocedal and Wright, Numerical Optimization, 2nd ed., Chapter 5.
- Saad, Iterative Methods for Sparse Linear Systems, 2nd ed., Chapters 6 and 9.
Further:
- Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain (1994).
- Greenbaum, Iterative Methods for Solving Linear Systems, Chapters 2-3.
- Martens, "Deep Learning via Hessian-Free Optimization" (2010).
Next Topics
- Preconditioned optimizers: how matrix geometry changes optimization speed.
- Second-order optimization methods: Newton, Hessian-free, and curvature-aware ML.
- Trust region methods: why CG often appears inside trust-region subproblems.
Last reviewed: April 25, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Matrix Normslayer 0A · tier 1
- Matrix Operations and Propertieslayer 0A · tier 1
- Numerical Linear Algebralayer 1 · tier 2
- Line Search Methodslayer 2 · tier 2
Derived topics
3- Trust Region Methodslayer 2 · tier 2
- Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradientlayer 3 · tier 2
- Second-Order Optimization Methodslayer 3 · tier 2