Newton's Method

Sneiderman, Robby

Numerical Optimization

Newton's Method

The gold standard for fast local convergence: use second-order information (the Hessian) to take optimal quadratic steps. Quadratic convergence when it works, catastrophic failure when it doesn't.

CoreTier 1StableSupporting~50 min

Prerequisites

Convex Optimization Basics Taylor Expansion The Hessian Matrix

Quiz (7)Pulse Check Prereq Map

Why This Matters

Gradient descent uses only first-order information: the gradient tells you which direction is downhill, but not how far to go. Newton's method uses second-order information. The curvature of the function. to determine both the direction and the step size. When the function is smooth and you are close to a minimum, this yields quadratic convergence: the number of correct digits doubles at each step.

Newton's method is the theoretical gold standard for local convergence. Every optimization algorithm you use in ML is either Newton's method, an approximation to it, or motivated by understanding where Newton's method fails and why.

Mental Model

At the current iterate $x^{(t)}$ , build a second-order Taylor approximation to the function. This gives you a quadratic model. Minimize the quadratic exactly. This has a closed-form solution. Move to that minimizer. Repeat.

If the function is actually quadratic, Newton's method converges in one step. If the function is "nearly quadratic" near the optimum (which smooth functions are), convergence is extremely fast.

Newton's Method for Root Finding

The original form of Newton's method finds roots of equations.

Definition

Newton's Method (Root Finding) $x^{(t + 1)} = x^{(t)} - f (x^{(t)}) / f^{'} (x^{(t)})$

Given a differentiable function $f: \mathbb{R} \to \mathbb{R}$ , Newton's method for finding a root $f(x^*) = 0$ iterates:

$x^{(t+1)} = x^{(t)} - \frac{f(x^{(t)})}{f'(x^{(t)})}$

Geometrically: linearize $f$ at $x^{(t)}$ and find where the tangent line crosses zero.

In multiple dimensions, finding a root of $F: \mathbb{R}^d \to \mathbb{R}^d$ becomes:

$x^{(t+1)} = x^{(t)} - [J_F(x^{(t)})]^{-1} F(x^{(t)})$

where $J_F$ is the Jacobian matrix of $F$ .

Newton's Method for Optimization

To minimize $f: \mathbb{R}^d \to \mathbb{R}$ , we want $\nabla f(x^*) = 0$ . Apply Newton's root-finding method to $F = \nabla f$ :

Definition

Newton's Method (Optimization) $x^{(t + 1)} = x^{(t)} - [\nabla^{2} f (x^{(t)})]^{- 1} \nabla f (x^{(t)})$

The Newton step for minimizing $f$ is:

$x^{(t+1)} = x^{(t)} - [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})$

where $\nabla^2 f(x^{(t)}) \in \mathbb{R}^{d \times d}$ is the Hessian matrix of second derivatives, and $\nabla f(x^{(t)}) \in \mathbb{R}^d$ is the gradient.

The Second-Order Taylor Motivation

At $x^{(t)}$ , the second-order Taylor expansion of $f$ is:

$f(x) \approx f(x^{(t)}) + \nabla f(x^{(t)})^\top (x - x^{(t)}) + \frac{1}{2}(x - x^{(t)})^\top \nabla^2 f(x^{(t)}) (x - x^{(t)})$

Call this quadratic model $m(x)$ . Setting $\nabla m(x) = 0$ :

$\nabla f(x^{(t)}) + \nabla^2 f(x^{(t)})(x - x^{(t)}) = 0$

$x = x^{(t)} - [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})$

This is exactly the Newton step. Newton's method minimizes the local quadratic model at each iteration.

Quadratic Convergence

Theorem

Quadratic Convergence of Newton's Method

Statement

Let $f: \mathbb{R}^d \to \mathbb{R}$ be twice continuously differentiable with $\nabla^2 f$ Lipschitz continuous (constant $L$ ). Let $x^*$ be a local minimizer with $\nabla^2 f(x^*) \succ 0$ . Then there exists $\delta > 0$ such that if $\|x^{(0)} - x^*\| < \delta$ , Newton's method converges quadratically:

$\|x^{(t+1)} - x^*\| \leq C \|x^{(t)} - x^*\|^2$

for some constant $C > 0$ depending on $L$ and $\nabla^2 f(x^*)$ .

Intuition

Quadratic convergence means the error is squared at each step. If you start with error $10^{-2}$ , after one step it is roughly $10^{-4}$ , then $10^{-8}$ , then $10^{-16}$ . The number of correct digits doubles every iteration, much faster than the linear convergence of gradient descent, where each step multiplies the error by a constant factor.

Proof Sketch

At $x^*$ , $\nabla f(x^*) = 0$ . The Newton step gives:

$x^{(t+1)} - x^* = x^{(t)} - x^* - [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})$

$= [\nabla^2 f(x^{(t)})]^{-1} \left[\nabla^2 f(x^{(t)})(x^{(t)} - x^*) - (\nabla f(x^{(t)}) - \nabla f(x^*))\right]$

The term in brackets is the error in a first-order Taylor expansion of $\nabla f$ . By the Lipschitz condition on the Hessian, this is $O(\|x^{(t)} - x^*\|^2)$ . Dividing by $\nabla^2 f(x^{(t)})$ (which is close to $\nabla^2 f(x^*)$ for $x^{(t)}$ near $x^*$ ) preserves the quadratic bound.

Why It Matters

Quadratic convergence is the gold standard. No first-order method can achieve it. This speed explains why Newton's method (and its approximations) dominate when applicable.

Failure Mode

The "sufficiently close" requirement is critical. Far from the optimum, pure Newton's method can diverge, overshoot, or converge to a maximum or saddle point (Newton finds stationary points, not necessarily minima). This motivates damped Newton and trust-region methods.

report a correction →

Interactive: tangent-line dynamics

Newton replaces $f$ with its tangent at $x_k$ and jumps to where that tangent crosses zero. Pick a preset and watch how the geometry decides the outcome. The quadratic converges in a handful of steps; arctan shoots off to infinity if you start beyond the inflection band; the cubic $x^3 - x$ has three roots, and a small nudge to $x_0$ can flip which basin you land in.

Damped Newton (Newton with Line Search)

Pure Newton takes the full step $-[\nabla^2 f]^{-1} \nabla f$ . Far from the optimum, this step may be too large or even go uphill. The fix:

$x^{(t+1)} = x^{(t)} - \alpha_t [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})$

where $\alpha_t \in (0, 1]$ is chosen by a line search (e.g., backtracking line search with Armijo condition). Far from the optimum, $\alpha_t$ may be small; close to the optimum, $\alpha_t = 1$ is accepted and you recover the full quadratic convergence rate.

The Newton direction $d = -[\nabla^2 f]^{-1} \nabla f$ is a descent direction whenever $\nabla^2 f \succ 0$ (positive definite), because:

$\nabla f^\top d = -\nabla f^\top [\nabla^2 f]^{-1} \nabla f < 0$

What damping does and does not buy you. Damping plus an Armijo/Wolfe line search gives global convergence guarantees only under additional structural assumptions: typically (i) the Hessian is positive definite along the iterates (e.g.\ strictly convex $f$ , or convex $f$ with bounded sublevel sets), (ii) the gradient and Hessian are Lipschitz, and (iii) the line search satisfies sufficient-decrease conditions. If the Hessian is indefinite — which is the generic case for nonconvex problems such as neural-network training — the raw direction $-[\nabla^2 f]^{-1}\nabla f$ need not be a descent direction at all, so no step size makes the line search succeed. Practical algorithms therefore modify the Hessian: either by adding a multiple of the identity ( $\nabla^2 f + \lambda I$ , Levenberg-Marquardt style), by performing a modified Cholesky factorization that bumps small or negative pivots (Gill-Murray-Wright; Nocedal-Wright 2006, §3.4), or by switching to a trust-region framework (Moré-Sorensen 1983; Conn-Gould-Toint 2000) which is globally convergent for nonconvex $f$ without requiring positive definiteness. Cubic-regularized Newton (Nesterov-Polyak 2006) gives global complexity bounds in the nonconvex setting under Hessian-Lipschitz assumptions. Damped Newton without one of these modifications is not globally convergent for general nonconvex objectives.

Why Newton Is Impractical for Large-Scale ML

The Hessian $\nabla^2 f \in \mathbb{R}^{d \times d}$ is the bottleneck:

Storage: For a neural network with $d = 10^8$ parameters, the Hessian has $d^2 = 10^{16}$ entries. This does not fit in any memory.
Hessian formation: There are $d^2$ second-partial entries to compute. The exact cost depends on the problem structure and the differentiation strategy; with reverse-mode automatic differentiation, forming the full Hessian typically costs on the order of $d$ gradient-like passes (one per row/column), so total Hessian-formation cost is roughly $O(d \cdot \mathrm{cost}(\nabla f))$ . For dense problems this is $O(d^2)$ to $O(d^3)$ work; for sparse problems it can be much cheaper. In modern ML, the explicit Hessian is almost never formed — practitioners instead use Hessian-vector products $\nabla^2 f \cdot v$ (Pearlmutter 1994), each costing $O(\mathrm{cost}(\nabla f))$ via one extra forward/backward pass.
Linear solve: Solving $\nabla^2 f \cdot \Delta x = -\nabla f$ takes $O(d^3)$ with direct dense factorization, or far less with conjugate-gradient on Hessian-vector products (no full Hessian formed).
Non-convexity: For non-convex problems (neural networks), the Hessian may have negative eigenvalues, making the Newton direction ascend rather than descend (see the damping discussion above).

This is why quasi-Newton methods (which approximate the Hessian cheaply) and first-order methods (SGD, Adam) dominate in practice.

Common Confusions

Watch Out

Newton for optimization vs. Newton for root-finding are different algorithms

Newton's method for root-finding solves $f(x) = 0$ . Newton's method for optimization solves $\nabla f(x) = 0$ . They have the same structure (linearize and solve), but the function being linearized is different: $f$ itself in root-finding, $\nabla f$ in optimization. In 1D, Newton for optimization uses the second derivative $f''$ (curvature), while root-finding uses the first derivative $f'$ (slope). The convergence theory is analogous but the algorithms solve structurally different problems.

Watch Out

Newton's method can converge to maxima or saddle points

Newton's method finds stationary points where $\nabla f = 0$ . It does not distinguish between minima, maxima, and saddle points. If you start near a saddle point, Newton may converge to it happily. Checking that the Hessian is positive definite at convergence is necessary to confirm you found a local minimum. Modifications like adding a multiple of the identity to the Hessian ( $\nabla^2 f + \lambda I$ ) ensure positive definiteness.

Summary

Newton's step: $x^{(t+1)} = x^{(t)} - [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})$
Motivated by minimizing the local second-order Taylor model
Quadratic convergence near a strict local minimum with Lipschitz Hessian
Converges in exactly 1 step for quadratic functions
Requires computing and inverting the $d \times d$ Hessian: $O(d^2)$ storage, $O(d^3)$ dense solve; Hessian-vector products avoid forming the matrix explicitly
Impractical for $d > 10^4$ in its pure form
Damped Newton with line search gives global convergence under convexity / positive-definite (or modified) Hessian assumptions; for nonconvex problems, use modified-Hessian, trust-region, or cubic-regularized variants

Exercises

ExerciseCore

Problem

Let $f(x) = \frac{1}{2} x^\top A x - b^\top x$ where $A \succ 0$ (positive definite). Show that Newton's method converges to the minimizer $x^* = A^{-1}b$ in exactly one step from any starting point.

ExerciseAdvanced

Problem

Show that gradient descent on the same quadratic $f(x) = \frac{1}{2}x^\top Ax - b^\top x$ with step size $\alpha = 2/(\lambda_{\max} + \lambda_{\min})$ (optimal fixed step size) has convergence rate:

$\|x^{(t)} - x^*\|_A \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^t \|x^{(0)} - x^*\|_A$

where $\kappa = \lambda_{\max}/\lambda_{\min}$ is the condition number. Compare this to Newton's one-step convergence.

References

Canonical:

Nocedal & Wright, Numerical Optimization (2006), Chapters 2-3 and Section 3.3 (Newton with line search)
Boyd & Vandenberghe, Convex Optimization (2004), Section 9.5
Dennis & Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations (1996), Chapter 5

Current:

Nesterov, Lectures on Convex Optimization (2018), Chapter 1
Conn, Gould, Toint, Trust-Region Methods (2000), Chapter 6 (trust-region Newton methods)
Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (2018), SIAM Review. Section 4 on second-order methods at scale.

Next Topics

The natural next steps from Newton's method:

Quasi-Newton methods: approximate the Hessian cheaply (BFGS, L-BFGS)
Line search methods: choosing step sizes with Wolfe conditions

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Taylor Expansionlayer 0A · tier 1
The Hessian Matrixlayer 0A · tier 1
Convex Optimization Basicslayer 1 · tier 1

Derived topics

7

Quasi-Newton Methodslayer 2 · tier 1
Secant Methodlayer 1 · tier 2
Line Search Methodslayer 2 · tier 2
Trust Region Methodslayer 2 · tier 2
Second-Order Optimization Methodslayer 3 · tier 2

+2 more on the derived-topics page.

Graph-backed continuations

Quasi-Newton Methods Line Search Methods Interior Point Methods Nonlinear Gauss-Seidel Secant Method Second-Order Optimization Methods Trust Region Methods