Skip to main content

Numerical Optimization

Newton's Method

The gold standard for fast local convergence: use second-order information (the Hessian) to take optimal quadratic steps. Quadratic convergence when it works, catastrophic failure when it doesn't.

CoreTier 1StableSupporting~50 min

Why This Matters

Gradient descent uses only first-order information: the gradient tells you which direction is downhill, but not how far to go. Newton's method uses second-order information. The curvature of the function. to determine both the direction and the step size. When the function is smooth and you are close to a minimum, this yields quadratic convergence: the number of correct digits doubles at each step.

Newton's method is the theoretical gold standard for local convergence. Every optimization algorithm you use in ML is either Newton's method, an approximation to it, or motivated by understanding where Newton's method fails and why.

f(x) = (x-2)² + 1xxxxxminimum048-10123Each step: xₙ₊₁ = xₙ - f'(xₙ)/f''(xₙ)Quadratic convergence: errors square each step

Mental Model

At the current iterate x(t)x^{(t)}, build a second-order Taylor approximation to the function. This gives you a quadratic model. Minimize the quadratic exactly. This has a closed-form solution. Move to that minimizer. Repeat.

If the function is actually quadratic, Newton's method converges in one step. If the function is "nearly quadratic" near the optimum (which smooth functions are), convergence is extremely fast.

Newton's Method for Root Finding

The original form of Newton's method finds roots of equations.

Definition

Newton's Method (Root Finding)

Given a differentiable function f:RRf: \mathbb{R} \to \mathbb{R}, Newton's method for finding a root f(x)=0f(x^*) = 0 iterates:

x(t+1)=x(t)f(x(t))f(x(t))x^{(t+1)} = x^{(t)} - \frac{f(x^{(t)})}{f'(x^{(t)})}

Geometrically: linearize ff at x(t)x^{(t)} and find where the tangent line crosses zero.

In multiple dimensions, finding a root of F:RdRdF: \mathbb{R}^d \to \mathbb{R}^d becomes:

x(t+1)=x(t)[JF(x(t))]1F(x(t))x^{(t+1)} = x^{(t)} - [J_F(x^{(t)})]^{-1} F(x^{(t)})

where JFJ_F is the Jacobian matrix of FF.

Newton's Method for Optimization

To minimize f:RdRf: \mathbb{R}^d \to \mathbb{R}, we want f(x)=0\nabla f(x^*) = 0. Apply Newton's root-finding method to F=fF = \nabla f:

Definition

Newton's Method (Optimization)

The Newton step for minimizing ff is:

x(t+1)=x(t)[2f(x(t))]1f(x(t))x^{(t+1)} = x^{(t)} - [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})

where 2f(x(t))Rd×d\nabla^2 f(x^{(t)}) \in \mathbb{R}^{d \times d} is the Hessian matrix of second derivatives, and f(x(t))Rd\nabla f(x^{(t)}) \in \mathbb{R}^d is the gradient.

The Second-Order Taylor Motivation

At x(t)x^{(t)}, the second-order Taylor expansion of ff is:

f(x)f(x(t))+f(x(t))(xx(t))+12(xx(t))2f(x(t))(xx(t))f(x) \approx f(x^{(t)}) + \nabla f(x^{(t)})^\top (x - x^{(t)}) + \frac{1}{2}(x - x^{(t)})^\top \nabla^2 f(x^{(t)}) (x - x^{(t)})

Call this quadratic model m(x)m(x). Setting m(x)=0\nabla m(x) = 0:

f(x(t))+2f(x(t))(xx(t))=0\nabla f(x^{(t)}) + \nabla^2 f(x^{(t)})(x - x^{(t)}) = 0

x=x(t)[2f(x(t))]1f(x(t))x = x^{(t)} - [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})

This is exactly the Newton step. Newton's method minimizes the local quadratic model at each iteration.

Quadratic Convergence

Theorem

Quadratic Convergence of Newton's Method

Statement

Let f:RdRf: \mathbb{R}^d \to \mathbb{R} be twice continuously differentiable with 2f\nabla^2 f Lipschitz continuous (constant LL). Let xx^* be a local minimizer with 2f(x)0\nabla^2 f(x^*) \succ 0. Then there exists δ>0\delta > 0 such that if x(0)x<δ\|x^{(0)} - x^*\| < \delta, Newton's method converges quadratically:

x(t+1)xCx(t)x2\|x^{(t+1)} - x^*\| \leq C \|x^{(t)} - x^*\|^2

for some constant C>0C > 0 depending on LL and 2f(x)\nabla^2 f(x^*).

Intuition

Quadratic convergence means the error is squared at each step. If you start with error 10210^{-2}, after one step it is roughly 10410^{-4}, then 10810^{-8}, then 101610^{-16}. The number of correct digits doubles every iteration, much faster than the linear convergence of gradient descent, where each step multiplies the error by a constant factor.

Proof Sketch

At xx^*, f(x)=0\nabla f(x^*) = 0. The Newton step gives:

x(t+1)x=x(t)x[2f(x(t))]1f(x(t))x^{(t+1)} - x^* = x^{(t)} - x^* - [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})

=[2f(x(t))]1[2f(x(t))(x(t)x)(f(x(t))f(x))]= [\nabla^2 f(x^{(t)})]^{-1} \left[\nabla^2 f(x^{(t)})(x^{(t)} - x^*) - (\nabla f(x^{(t)}) - \nabla f(x^*))\right]

The term in brackets is the error in a first-order Taylor expansion of f\nabla f. By the Lipschitz condition on the Hessian, this is O(x(t)x2)O(\|x^{(t)} - x^*\|^2). Dividing by 2f(x(t))\nabla^2 f(x^{(t)}) (which is close to 2f(x)\nabla^2 f(x^*) for x(t)x^{(t)} near xx^*) preserves the quadratic bound.

Why It Matters

Quadratic convergence is the gold standard. No first-order method can achieve it. This speed explains why Newton's method (and its approximations) dominate when applicable.

Failure Mode

The "sufficiently close" requirement is critical. Far from the optimum, pure Newton's method can diverge, overshoot, or converge to a maximum or saddle point (Newton finds stationary points, not necessarily minima). This motivates damped Newton and trust-region methods.

Interactive: tangent-line dynamics

Newton replaces ff with its tangent at xkx_k and jumps to where that tangent crosses zero. Pick a preset and watch how the geometry decides the outcome. The quadratic converges in a handful of steps; arctan shoots off to infinity if you start beyond the inflection band; the cubic x3xx^3 - x has three roots, and a small nudge to x0x_0 can flip which basin you land in.

Damped Newton (Newton with Line Search)

Pure Newton takes the full step [2f]1f-[\nabla^2 f]^{-1} \nabla f. Far from the optimum, this step may be too large or even go uphill. The fix:

x(t+1)=x(t)αt[2f(x(t))]1f(x(t))x^{(t+1)} = x^{(t)} - \alpha_t [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})

where αt(0,1]\alpha_t \in (0, 1] is chosen by a line search (e.g., backtracking line search with Armijo condition). Far from the optimum, αt\alpha_t may be small; close to the optimum, αt=1\alpha_t = 1 is accepted and you recover the full quadratic convergence rate.

The Newton direction d=[2f]1fd = -[\nabla^2 f]^{-1} \nabla f is a descent direction whenever 2f0\nabla^2 f \succ 0 (positive definite), because:

fd=f[2f]1f<0\nabla f^\top d = -\nabla f^\top [\nabla^2 f]^{-1} \nabla f < 0

What damping does and does not buy you. Damping plus an Armijo/Wolfe line search gives global convergence guarantees only under additional structural assumptions: typically (i) the Hessian is positive definite along the iterates (e.g.\ strictly convex ff, or convex ff with bounded sublevel sets), (ii) the gradient and Hessian are Lipschitz, and (iii) the line search satisfies sufficient-decrease conditions. If the Hessian is indefinite — which is the generic case for nonconvex problems such as neural-network training — the raw direction [2f]1f-[\nabla^2 f]^{-1}\nabla f need not be a descent direction at all, so no step size makes the line search succeed. Practical algorithms therefore modify the Hessian: either by adding a multiple of the identity (2f+λI\nabla^2 f + \lambda I, Levenberg-Marquardt style), by performing a modified Cholesky factorization that bumps small or negative pivots (Gill-Murray-Wright; Nocedal-Wright 2006, §3.4), or by switching to a trust-region framework (Moré-Sorensen 1983; Conn-Gould-Toint 2000) which is globally convergent for nonconvex ff without requiring positive definiteness. Cubic-regularized Newton (Nesterov-Polyak 2006) gives global complexity bounds in the nonconvex setting under Hessian-Lipschitz assumptions. Damped Newton without one of these modifications is not globally convergent for general nonconvex objectives.

Why Newton Is Impractical for Large-Scale ML

The Hessian 2fRd×d\nabla^2 f \in \mathbb{R}^{d \times d} is the bottleneck:

  1. Storage: For a neural network with d=108d = 10^8 parameters, the Hessian has d2=1016d^2 = 10^{16} entries. This does not fit in any memory.
  2. Hessian formation: There are d2d^2 second-partial entries to compute. The exact cost depends on the problem structure and the differentiation strategy; with reverse-mode automatic differentiation, forming the full Hessian typically costs on the order of dd gradient-like passes (one per row/column), so total Hessian-formation cost is roughly O(dcost(f))O(d \cdot \mathrm{cost}(\nabla f)). For dense problems this is O(d2)O(d^2) to O(d3)O(d^3) work; for sparse problems it can be much cheaper. In modern ML, the explicit Hessian is almost never formed — practitioners instead use Hessian-vector products 2fv\nabla^2 f \cdot v (Pearlmutter 1994), each costing O(cost(f))O(\mathrm{cost}(\nabla f)) via one extra forward/backward pass.
  3. Linear solve: Solving 2fΔx=f\nabla^2 f \cdot \Delta x = -\nabla f takes O(d3)O(d^3) with direct dense factorization, or far less with conjugate-gradient on Hessian-vector products (no full Hessian formed).
  4. Non-convexity: For non-convex problems (neural networks), the Hessian may have negative eigenvalues, making the Newton direction ascend rather than descend (see the damping discussion above).

This is why quasi-Newton methods (which approximate the Hessian cheaply) and first-order methods (SGD, Adam) dominate in practice.

Common Confusions

Watch Out

Newton for optimization vs. Newton for root-finding are different algorithms

Newton's method for root-finding solves f(x)=0f(x) = 0. Newton's method for optimization solves f(x)=0\nabla f(x) = 0. They have the same structure (linearize and solve), but the function being linearized is different: ff itself in root-finding, f\nabla f in optimization. In 1D, Newton for optimization uses the second derivative ff'' (curvature), while root-finding uses the first derivative ff' (slope). The convergence theory is analogous but the algorithms solve structurally different problems.

Watch Out

Newton's method can converge to maxima or saddle points

Newton's method finds stationary points where f=0\nabla f = 0. It does not distinguish between minima, maxima, and saddle points. If you start near a saddle point, Newton may converge to it happily. Checking that the Hessian is positive definite at convergence is necessary to confirm you found a local minimum. Modifications like adding a multiple of the identity to the Hessian (2f+λI\nabla^2 f + \lambda I) ensure positive definiteness.

Summary

  • Newton's step: x(t+1)=x(t)[2f(x(t))]1f(x(t))x^{(t+1)} = x^{(t)} - [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})
  • Motivated by minimizing the local second-order Taylor model
  • Quadratic convergence near a strict local minimum with Lipschitz Hessian
  • Converges in exactly 1 step for quadratic functions
  • Requires computing and inverting the d×dd \times d Hessian: O(d2)O(d^2) storage, O(d3)O(d^3) dense solve; Hessian-vector products avoid forming the matrix explicitly
  • Impractical for d>104d > 10^4 in its pure form
  • Damped Newton with line search gives global convergence under convexity / positive-definite (or modified) Hessian assumptions; for nonconvex problems, use modified-Hessian, trust-region, or cubic-regularized variants

Exercises

ExerciseCore

Problem

Let f(x)=12xAxbxf(x) = \frac{1}{2} x^\top A x - b^\top x where A0A \succ 0 (positive definite). Show that Newton's method converges to the minimizer x=A1bx^* = A^{-1}b in exactly one step from any starting point.

ExerciseAdvanced

Problem

Show that gradient descent on the same quadratic f(x)=12xAxbxf(x) = \frac{1}{2}x^\top Ax - b^\top x with step size α=2/(λmax+λmin)\alpha = 2/(\lambda_{\max} + \lambda_{\min}) (optimal fixed step size) has convergence rate:

x(t)xA(κ1κ+1)tx(0)xA\|x^{(t)} - x^*\|_A \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^t \|x^{(0)} - x^*\|_A

where κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min} is the condition number. Compare this to Newton's one-step convergence.

References

Canonical:

  • Nocedal & Wright, Numerical Optimization (2006), Chapters 2-3 and Section 3.3 (Newton with line search)
  • Boyd & Vandenberghe, Convex Optimization (2004), Section 9.5
  • Dennis & Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations (1996), Chapter 5

Current:

  • Nesterov, Lectures on Convex Optimization (2018), Chapter 1
  • Conn, Gould, Toint, Trust-Region Methods (2000), Chapter 6 (trust-region Newton methods)
  • Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (2018), SIAM Review. Section 4 on second-order methods at scale.

Next Topics

The natural next steps from Newton's method:

  • Quasi-Newton methods: approximate the Hessian cheaply (BFGS, L-BFGS)
  • Line search methods: choosing step sizes with Wolfe conditions

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

7

+2 more on the derived-topics page.