Skip to main content

Foundations

Differentiation in Rⁿ

Partial derivatives, directional derivatives, gradients, total derivatives, and the multivariable chain rule. The point is not notation: differentiability means one linear map predicts all small directions.

CoreTier 1StableSupporting~55 min

Why This Matters

Machine learning optimization is built on one sentence:

Hide overviewShow overview
Five-panel infographic on multivariable differentiation: directional derivatives, partial derivatives, the total derivative as a linear map, the Jacobian and gradient, the chain rule for composition of differentiable maps, and applications (backpropagation, implicit function theorem, manifold optimization).
The total derivative is the linear map that best approximates a function at a point. The Jacobian is its matrix; the gradient is the special case for scalar-valued functions.

Near the current parameters, replace the loss by a linear or quadratic approximation, then choose a step.

That sentence only makes sense if you understand differentiation in Rn\mathbb{R}^n. Gradients tell SGD and Adam which way the loss changes. Jacobians compose through neural-network layers. Hessians explain curvature, Newton steps, conditioning, and why a learning rate can explode.

The trap is thinking that multivariable differentiation is just "take partial derivatives." It is stricter than that. The real object is the total derivative: one linear map that predicts the function for every small direction at once.

The Local Linear Model

For f:RnRmf:\mathbb{R}^n\to\mathbb{R}^m, differentiability at aa means there is a linear map Df(a):RnRmDf(a):\mathbb{R}^n\to\mathbb{R}^m such that

f(a+h)=f(a)+Df(a)h+r(h),r(h)/h0.f(a+h)=f(a)+Df(a)h+r(h),\qquad \|r(h)\|/\|h\|\to 0.

The remainder r(h)r(h) must be small compared with the step length h\|h\|. This is the multivariable version of a tangent line, but now the tangent object is a linear map.

Coordinate Derivatives

Definition

Partial Derivative

For f:RnRf:\mathbb{R}^n\to\mathbb{R}, the partial derivative with respect to coordinate ii at aa is

fxi(a)=limt0f(a+tei)f(a)t.\frac{\partial f}{\partial x_i}(a)=\lim_{t\to 0}\frac{f(a+t e_i)-f(a)}{t}.

It measures change when only the iith coordinate moves.

Partial derivatives are coordinate probes. They are useful, but they do not by themselves prove the function has a good local linear approximation in every direction.

Definition

Directional Derivative

For a unit vector vRnv\in\mathbb{R}^n, the directional derivative is

Dvf(a)=limt0f(a+tv)f(a)t.D_v f(a)=\lim_{t\to 0}\frac{f(a+tv)-f(a)}{t}.

It measures the rate of change along the line through aa in direction vv.

Directional derivatives probe more directions than partial derivatives, but even all directional derivatives existing is not the same thing as differentiability. Differentiability requires those directional rates to come from one linear map.

Gradient And Total Derivative

Definition

Gradient

For a scalar-valued function f:RnRf:\mathbb{R}^n\to\mathbb{R}, the gradient is the column vector

f(a)=(fx1(a),,fxn(a))T.\nabla f(a)=\left(\frac{\partial f}{\partial x_1}(a),\ldots,\frac{\partial f}{\partial x_n}(a)\right)^T.

When ff is differentiable, Df(a)h=f(a)ThDf(a)h=\nabla f(a)^T h.

The gradient is the vector representation of the total derivative for scalar outputs. The total derivative itself is a linear functional, often written as a row vector. This row-versus-column distinction matters when multiplying Jacobians in the chain rule.

Definition

Jacobian Matrix

For f:RnRmf:\mathbb{R}^n\to\mathbb{R}^m, the total derivative Df(a)Df(a) is represented by the mm by nn Jacobian matrix

Jf(a)ij=fixj(a).J_f(a)_{ij}=\frac{\partial f_i}{\partial x_j}(a).

The Jacobian maps input perturbations hh to first-order output perturbations.

Steepest Direction

Theorem

Gradient As Steepest Ascent Direction

Statement

If f(a)0\nabla f(a) \neq 0, then among all unit vectors vv, the directional derivative Dvf(a)=f(a)TvD_v f(a) = \nabla f(a)^T v is maximized by v=f(a)/f(a)v = \nabla f(a)/\|\nabla f(a)\|, and the maximum value is f(a)\|\nabla f(a)\|. The minimum is achieved in the negative-gradient direction with value f(a)-\|\nabla f(a)\|.

If f(a)=0\nabla f(a) = 0, the expression f(a)/f(a)\nabla f(a)/\|\nabla f(a)\| is undefined; in this case every directional derivative satisfies Dvf(a)=0D_v f(a) = 0, every unit vector is simultaneously a maximizer and minimizer, and there is no first-order steepest direction. Determining whether aa is a min, max, or saddle requires second-order information (the Hessian).

Intuition

The dot product is largest when two vectors point in the same direction. The gradient points uphill fastest; the negative gradient points downhill fastest for an infinitesimal Euclidean step.

Proof Sketch

By Cauchy-Schwarz, f(a)Tvf(a)v=f(a)\nabla f(a)^T v\leq \|\nabla f(a)\|\|v\|=\|\nabla f(a)\|. Equality holds exactly when vv is a positive scalar multiple of f(a)\nabla f(a). The minimum follows by using the opposite direction.

Why It Matters

This is the mathematical reason gradient descent, SGD, and Adam use θL(θ)-\nabla_\theta \mathcal{L}(\theta) as their primary signal.

Failure Mode

This is an infinitesimal claim. A finite step can overshoot because curvature enters through higher-order terms. That is why line search, learning-rate schedules, and second-order methods exist.

When Partials Are Enough

Theorem

Continuous Partial Derivatives Imply Differentiability

Statement

If all first partial derivatives of f:RnRmf:\mathbb{R}^n\to\mathbb{R}^m exist in a neighborhood of aa and are continuous at aa, then ff is differentiable at aa.

Intuition

Continuity of the partial derivatives prevents coordinate-wise rates from changing wildly as you approach the point from different directions. The coordinate probes assemble into one stable linear approximation.

Proof Sketch

Write f(a+h)f(a)f(a+h)-f(a) as a telescoping sum that changes one coordinate at a time. Apply the one-variable mean value theorem to each coordinate change. Continuity of the partial derivatives turns each coefficient into the corresponding partial derivative at aa plus a small error. The sum becomes Df(a)h+o(h)Df(a)h+o(\|h\|).

Why It Matters

Most smooth ML losses satisfy this condition away from nonsmooth activations or constraints, so computing the Jacobian from partial derivatives is usually legitimate.

Chain Rule

Theorem

Multivariable Chain Rule

Statement

If g:RnRkg:\mathbb{R}^n\to\mathbb{R}^k is differentiable at aa and f:RkRmf:\mathbb{R}^k\to\mathbb{R}^m is differentiable at g(a)g(a), then fgf\circ g is differentiable at aa and

D(fg)(a)=Df(g(a))Dg(a).D(f\circ g)(a)=Df(g(a))Dg(a).

Intuition

Each differentiable map is locally linear. Composing functions locally composes their linear approximations, so the derivative of the composition is matrix multiplication.

Proof Sketch

Use g(a+h)=g(a)+Dg(a)h+o(h)g(a+h)=g(a)+Dg(a)h+o(\|h\|). Then apply the differentiability of ff at g(a)g(a) to the perturbation Dg(a)h+o(h)Dg(a)h+o(\|h\|). Linearity of Df(g(a))Df(g(a)) produces Df(g(a))Dg(a)hDf(g(a))Dg(a)h plus a remainder smaller than h\|h\|.

Why It Matters

Backpropagation is the chain rule on a computational graph. A network is a composition of layers; reverse-mode automatic differentiation multiplies local derivatives in the direction that is efficient for scalar losses.

Failure Mode

Nonsmooth points such as ReLU at zero break classical differentiability. Optimizers still work by using subgradients or implementation conventions, but theorem assumptions should not be silently ignored.

ML Translation Table

Math objectML interpretationWhat to check
xRnx\in\mathbb{R}^nparameters, features, or activationswhich variable is being differentiated
f(x)f(x)loss, layer map, score, or metricscalar or vector output
f(x)\nabla f(x)first-order signal for update directionnorm, scale, noise, clipping
Df(x)Df(x) / Jacobianlocal sensitivity of outputs to inputsshape, conditioning, chain rule order
D2f(x)D^2f(x) / Hessiancurvature of the scalar losspositive definite, indefinite, ill-conditioned
o(h)o(\|h\|)ignored higher-order errorwhen finite steps make the linear model invalid

Common Confusions

Watch Out

Partial derivatives existing is not enough

There are functions with all partial derivatives at a point that are not continuous there, and therefore not differentiable. Partial derivatives are coordinate checks; differentiability is an all-directions linear approximation.

Watch Out

The gradient is not the derivative for vector outputs

For f:RnRf:\mathbb{R}^n\to\mathbb{R}, the gradient represents the derivative. For f:RnRmf:\mathbb{R}^n\to\mathbb{R}^m, the derivative is a Jacobian matrix, not a single gradient vector.

Watch Out

Steepest descent depends on the norm

The Euclidean gradient is steepest for Euclidean step size. If you change the geometry, the steepest direction changes. Natural gradient and preconditioned methods exploit exactly this fact.

Watch Out

Backprop is not magic

Backpropagation is not a separate calculus. It is the multivariable chain rule organized efficiently so that scalar loss gradients are computed without forming every full Jacobian.

Q&A For Mastery

Why does differentiability require an open neighborhood? You need to approach the point from many directions. On a boundary or constrained set, ordinary derivatives may need to be replaced by restricted, one-sided, or manifold derivatives.

What should I remember for neural networks? Each layer has a local derivative. Backprop multiplies those local derivatives in reverse order to get gradients with respect to parameters.

Why do gradients vanish or explode? Long products of Jacobians can shrink or amplify vectors. This is the calculus reason recurrent nets and deep networks need normalization, residual connections, initialization care, and gating.

What is the practical shape check? If f:RnRmf:\mathbb{R}^n\to\mathbb{R}^m, then JfJ_f is mm by nn. Multiplying JfhJ_f h should produce a vector in Rm\mathbb{R}^m.

What To Remember

  • Differentiability means one linear map approximates all small directions.
  • Partial derivatives are coordinate probes, not the whole story.
  • For scalar outputs, Df(a)h=f(a)ThDf(a)h=\nabla f(a)^T h.
  • The negative gradient is the steepest infinitesimal descent direction in Euclidean geometry.
  • The chain rule is matrix multiplication of local linear maps.
  • Backpropagation is the chain rule arranged for computational efficiency.

Exercises

ExerciseCore

Problem

Let f(x,y)=x2y+exyf(x,y)=x^2y+e^{xy}. Compute f(1,0)\nabla f(1,0) and the directional derivative in direction v=(3/5,4/5)v=(3/5,4/5).

ExerciseAdvanced

Problem

Suppose f:RnRf:\mathbb{R}^n\to\mathbb{R} is differentiable at aa. Prove that if f(a)=0\nabla f(a)=0, then every directional derivative at aa equals zero.

ExerciseResearch

Problem

Explain why ReLU networks can be trained with gradient methods even though ReLU is not differentiable at zero.

References

Canonical:

  • Rudin, Principles of Mathematical Analysis, 3rd ed., Chapter 9.
  • Spivak, Calculus on Manifolds, Chapter 2.
  • Apostol, Mathematical Analysis, 2nd ed., Chapter 12.

ML-facing:

  • Deisenroth, Faisal, and Ong, Mathematics for Machine Learning, Chapter 5.
  • Goodfellow, Bengio, and Courville, Deep Learning, Chapter 4.
  • Boyd and Vandenberghe, Convex Optimization, Appendix A.4.

Next Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

11

+6 more on the derived-topics page.