Numerical Optimization
Newton's Method
The gold standard for fast local convergence: use second-order information (the Hessian) to take optimal quadratic steps. Quadratic convergence when it works, catastrophic failure when it doesn't.
Why This Matters
Gradient descent uses only first-order information: the gradient tells you which direction is downhill, but not how far to go. Newton's method uses second-order information. The curvature of the function. to determine both the direction and the step size. When the function is smooth and you are close to a minimum, this yields quadratic convergence: the number of correct digits doubles at each step.
Newton's method is the theoretical gold standard for local convergence. Every optimization algorithm you use in ML is either Newton's method, an approximation to it, or motivated by understanding where Newton's method fails and why.
Mental Model
At the current iterate , build a second-order Taylor approximation to the function. This gives you a quadratic model. Minimize the quadratic exactly. This has a closed-form solution. Move to that minimizer. Repeat.
If the function is actually quadratic, Newton's method converges in one step. If the function is "nearly quadratic" near the optimum (which smooth functions are), convergence is extremely fast.
Newton's Method for Root Finding
The original form of Newton's method finds roots of equations.
Newton's Method (Root Finding)
Given a differentiable function , Newton's method for finding a root iterates:
Geometrically: linearize at and find where the tangent line crosses zero.
In multiple dimensions, finding a root of becomes:
where is the Jacobian matrix of .
Newton's Method for Optimization
To minimize , we want . Apply Newton's root-finding method to :
Newton's Method (Optimization)
The Newton step for minimizing is:
where is the Hessian matrix of second derivatives, and is the gradient.
The Second-Order Taylor Motivation
At , the second-order Taylor expansion of is:
Call this quadratic model . Setting :
This is exactly the Newton step. Newton's method minimizes the local quadratic model at each iteration.
Quadratic Convergence
Quadratic Convergence of Newton's Method
Statement
Let be twice continuously differentiable with Lipschitz continuous (constant ). Let be a local minimizer with . Then there exists such that if , Newton's method converges quadratically:
for some constant depending on and .
Intuition
Quadratic convergence means the error is squared at each step. If you start with error , after one step it is roughly , then , then . The number of correct digits doubles every iteration, much faster than the linear convergence of gradient descent, where each step multiplies the error by a constant factor.
Proof Sketch
At , . The Newton step gives:
The term in brackets is the error in a first-order Taylor expansion of . By the Lipschitz condition on the Hessian, this is . Dividing by (which is close to for near ) preserves the quadratic bound.
Why It Matters
Quadratic convergence is the gold standard. No first-order method can achieve it. This speed explains why Newton's method (and its approximations) dominate when applicable.
Failure Mode
The "sufficiently close" requirement is critical. Far from the optimum, pure Newton's method can diverge, overshoot, or converge to a maximum or saddle point (Newton finds stationary points, not necessarily minima). This motivates damped Newton and trust-region methods.
Interactive: tangent-line dynamics
Newton replaces with its tangent at and jumps to where that tangent crosses zero. Pick a preset and watch how the geometry decides the outcome. The quadratic converges in a handful of steps; arctan shoots off to infinity if you start beyond the inflection band; the cubic has three roots, and a small nudge to can flip which basin you land in.
Damped Newton (Newton with Line Search)
Pure Newton takes the full step . Far from the optimum, this step may be too large or even go uphill. The fix:
where is chosen by a line search (e.g., backtracking line search with Armijo condition). Far from the optimum, may be small; close to the optimum, is accepted and you recover the full quadratic convergence rate.
The Newton direction is a descent direction whenever (positive definite), because:
What damping does and does not buy you. Damping plus an Armijo/Wolfe line search gives global convergence guarantees only under additional structural assumptions: typically (i) the Hessian is positive definite along the iterates (e.g.\ strictly convex , or convex with bounded sublevel sets), (ii) the gradient and Hessian are Lipschitz, and (iii) the line search satisfies sufficient-decrease conditions. If the Hessian is indefinite — which is the generic case for nonconvex problems such as neural-network training — the raw direction need not be a descent direction at all, so no step size makes the line search succeed. Practical algorithms therefore modify the Hessian: either by adding a multiple of the identity (, Levenberg-Marquardt style), by performing a modified Cholesky factorization that bumps small or negative pivots (Gill-Murray-Wright; Nocedal-Wright 2006, §3.4), or by switching to a trust-region framework (Moré-Sorensen 1983; Conn-Gould-Toint 2000) which is globally convergent for nonconvex without requiring positive definiteness. Cubic-regularized Newton (Nesterov-Polyak 2006) gives global complexity bounds in the nonconvex setting under Hessian-Lipschitz assumptions. Damped Newton without one of these modifications is not globally convergent for general nonconvex objectives.
Why Newton Is Impractical for Large-Scale ML
The Hessian is the bottleneck:
- Storage: For a neural network with parameters, the Hessian has entries. This does not fit in any memory.
- Hessian formation: There are second-partial entries to compute. The exact cost depends on the problem structure and the differentiation strategy; with reverse-mode automatic differentiation, forming the full Hessian typically costs on the order of gradient-like passes (one per row/column), so total Hessian-formation cost is roughly . For dense problems this is to work; for sparse problems it can be much cheaper. In modern ML, the explicit Hessian is almost never formed — practitioners instead use Hessian-vector products (Pearlmutter 1994), each costing via one extra forward/backward pass.
- Linear solve: Solving takes with direct dense factorization, or far less with conjugate-gradient on Hessian-vector products (no full Hessian formed).
- Non-convexity: For non-convex problems (neural networks), the Hessian may have negative eigenvalues, making the Newton direction ascend rather than descend (see the damping discussion above).
This is why quasi-Newton methods (which approximate the Hessian cheaply) and first-order methods (SGD, Adam) dominate in practice.
Common Confusions
Newton for optimization vs. Newton for root-finding are different algorithms
Newton's method for root-finding solves . Newton's method for optimization solves . They have the same structure (linearize and solve), but the function being linearized is different: itself in root-finding, in optimization. In 1D, Newton for optimization uses the second derivative (curvature), while root-finding uses the first derivative (slope). The convergence theory is analogous but the algorithms solve structurally different problems.
Newton's method can converge to maxima or saddle points
Newton's method finds stationary points where . It does not distinguish between minima, maxima, and saddle points. If you start near a saddle point, Newton may converge to it happily. Checking that the Hessian is positive definite at convergence is necessary to confirm you found a local minimum. Modifications like adding a multiple of the identity to the Hessian () ensure positive definiteness.
Summary
- Newton's step:
- Motivated by minimizing the local second-order Taylor model
- Quadratic convergence near a strict local minimum with Lipschitz Hessian
- Converges in exactly 1 step for quadratic functions
- Requires computing and inverting the Hessian: storage, dense solve; Hessian-vector products avoid forming the matrix explicitly
- Impractical for in its pure form
- Damped Newton with line search gives global convergence under convexity / positive-definite (or modified) Hessian assumptions; for nonconvex problems, use modified-Hessian, trust-region, or cubic-regularized variants
Exercises
Problem
Let where (positive definite). Show that Newton's method converges to the minimizer in exactly one step from any starting point.
Problem
Show that gradient descent on the same quadratic with step size (optimal fixed step size) has convergence rate:
where is the condition number. Compare this to Newton's one-step convergence.
References
Canonical:
- Nocedal & Wright, Numerical Optimization (2006), Chapters 2-3 and Section 3.3 (Newton with line search)
- Boyd & Vandenberghe, Convex Optimization (2004), Section 9.5
- Dennis & Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations (1996), Chapter 5
Current:
- Nesterov, Lectures on Convex Optimization (2018), Chapter 1
- Conn, Gould, Toint, Trust-Region Methods (2000), Chapter 6 (trust-region Newton methods)
- Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (2018), SIAM Review. Section 4 on second-order methods at scale.
Next Topics
The natural next steps from Newton's method:
- Quasi-Newton methods: approximate the Hessian cheaply (BFGS, L-BFGS)
- Line search methods: choosing step sizes with Wolfe conditions
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Taylor Expansionlayer 0A · tier 1
- The Hessian Matrixlayer 0A · tier 1
- Convex Optimization Basicslayer 1 · tier 1
Derived topics
7- Quasi-Newton Methodslayer 2 · tier 1
- Secant Methodlayer 1 · tier 2
- Line Search Methodslayer 2 · tier 2
- Trust Region Methodslayer 2 · tier 2
- Second-Order Optimization Methodslayer 3 · tier 2
+2 more on the derived-topics page.