Skip to main content

Mathematical Infrastructure

Vector Calculus Chain Rule

The chain rule for compositions of multivariable maps. Jacobians multiply when functions compose; gradients of scalar-valued compositions become vector-Jacobian products. The result that makes backpropagation a one-line theorem.

CoreTier 1StableCore spine~25 min

Why This Matters

The single-variable chain rule says (fg)(x)=f(g(x))g(x)(f \circ g)'(x) = f'(g(x)) \cdot g'(x). The multivariable version replaces those scalar derivatives with Jacobians, and the product becomes a matrix product. That one upgrade is enough to derive backpropagation in one line: a deep network is a composition fLfL1f1f_L \circ f_{L-1} \circ \cdots \circ f_1, its Jacobian is the product JLJL1J1J_L J_{L-1} \cdots J_1, and the gradient of a scalar loss is a vector-Jacobian product evaluated right-to-left.

This page states the rule with assumptions, gives the matrix form, derives the gradient corollary, and shows the implicit chain rule that handles constraints. It does not re-derive backprop end-to-end; that lives on its own page.

The Rule

Theorem

Multivariable Chain Rule

Statement

Let g:UVg: U \to V be differentiable at x0URnx_0 \in U \subset \mathbb{R}^n, with g(U)VRmg(U) \subset V \subset \mathbb{R}^m, and let f:VRpf: V \to \mathbb{R}^p be differentiable at y0=g(x0)y_0 = g(x_0). Then the composition h=fgh = f \circ g is differentiable at x0x_0 and its Jacobian satisfies Jh(x0)=Jf(g(x0))Jg(x0),J_h(x_0) = J_f(g(x_0)) \cdot J_g(x_0), where the right-hand side is matrix multiplication of a p×mp \times m matrix with an m×nm \times n matrix.

Intuition

Differentiability of gg at x0x_0 means gg admits a best linear approximation: g(x0+h)=g(x0)+Jg(x0)h+o(h)g(x_0 + h) = g(x_0) + J_g(x_0) h + o(\|h\|). Apply ff and use its own linear approximation at g(x0)g(x_0). The composed approximation is linear with matrix Jf(g(x0))Jg(x0)J_f(g(x_0)) \cdot J_g(x_0). Because the linear approximation of a differentiable map is unique, this matrix is the Jacobian of hh.

Proof Sketch

By assumption g(x0+h)g(x0)=Jg(x0)h+rg(h)g(x_0 + h) - g(x_0) = J_g(x_0) h + r_g(h) with rg(h)=o(h)r_g(h) = o(\|h\|), and f(y0+k)f(y0)=Jf(y0)k+rf(k)f(y_0 + k) - f(y_0) = J_f(y_0) k + r_f(k) with rf(k)=o(k)r_f(k) = o(\|k\|). Set k=Jg(x0)h+rg(h)k = J_g(x_0) h + r_g(h) and substitute: h(x0+h)h(x0)=Jf(y0)Jg(x0)h+Jf(y0)rg(h)+rf(k).h(x_0 + h) - h(x_0) = J_f(y_0) J_g(x_0) h + J_f(y_0) r_g(h) + r_f(k). The term Jf(y0)rg(h)J_f(y_0) r_g(h) is o(h)o(\|h\|) because Jf(y0)J_f(y_0) is a fixed linear map. The term rf(k)r_f(k) is o(k)o(\|k\|) and kJg(x0)h+rg(h)\|k\| \leq \|J_g(x_0)\| \|h\| + \|r_g(h)\|, so rf(k)=o(h)r_f(k) = o(\|h\|) as well. The remainder is o(h)o(\|h\|), identifying Jf(y0)Jg(x0)J_f(y_0) J_g(x_0) as the Jacobian of hh.

Why It Matters

Every layer in a neural network is a function f:Rd1Rdf_\ell: \mathbb{R}^{d_{\ell-1}} \to \mathbb{R}^{d_\ell}. A depth-LL network is the composition h=fLf1h = f_L \circ \cdots \circ f_1. The chain rule gives Jh=JfLJf1J_h = J_{f_L} \cdots J_{f_1}, a product of LL matrices. Forward-mode autodiff multiplies left-to-right; reverse-mode (backprop) multiplies right-to-left starting from the loss gradient. The asymmetry is not flop-equal-with-different-memory: forward mode computes one directional derivative per pass and so requires ninn_{\text{in}} passes to recover a full Jacobian (cheap when input dimension is small), while reverse mode computes one gradient per pass and so requires noutn_{\text{out}} passes (cheap when output dimension is small). For a scalar loss, nout=1n_{\text{out}} = 1 and a single reverse pass yields the full gradient at cost comparable to one forward evaluation (cheap-gradient principle). This input-vs-output-dimension scaling, not memory layout, is the actual reason backprop wins for deep nets with many parameters.

Failure Mode

Differentiability at the inner point is required, not just continuity. Composing a non-differentiable inner map with a differentiable outer one breaks the rule even when the composition happens to be differentiable. ReLU activations are not differentiable at zero, so neural-network practice uses subgradients there; modern autodiff frameworks pick a convention (typically 00 or 11) and document it.

Gradient Corollary

For scalar-valued compositions the chain rule has a familiar special case. Let L:RpRL: \mathbb{R}^p \to \mathbb{R} be differentiable. Define (x)=L(g(x))\ell(x) = L(g(x)) for g:RnRpg: \mathbb{R}^n \to \mathbb{R}^p. By the theorem, J(x0)=JL(g(x0))Jg(x0)J_\ell(x_0) = J_L(g(x_0)) \cdot J_g(x_0). Since LL is scalar-valued, JL(y)=L(y)TJ_L(y) = \nabla L(y)^T is a row vector of size pp, and JgJ_g is p×np \times n. The product is a row vector of size nn, so (x0)=Jg(x0)TL(g(x0)).\nabla \ell(x_0) = J_g(x_0)^T \nabla L(g(x_0)). This is a vector-Jacobian product (VJP): the upstream gradient L(g(x0))\nabla L(g(x_0)) is propagated backward by left-multiplying by JgTJ_g^T. Backprop is exactly this identity applied recursively layer by layer.

Implicit Function Chain Rule

Sometimes a variable is defined implicitly: F(x,y)=0F(x, y) = 0 specifies yy as a function of xx near a point where F/y\partial F / \partial y is invertible. The implicit function theorem guarantees a smooth y(x)y(x) exists locally, and differentiating F(x,y(x))=0F(x, y(x)) = 0 via the chain rule gives

Fx+Fydydx=0dydx=(Fy)1Fx.\frac{\partial F}{\partial x} + \frac{\partial F}{\partial y} \cdot \frac{dy}{dx} = 0 \quad\Longrightarrow\quad \frac{dy}{dx} = -\left(\frac{\partial F}{\partial y}\right)^{-1} \frac{\partial F}{\partial x}.

Implicit differentiation underpins meta-learning by implicit gradients, deep-equilibrium models (DEQs), and the gradient of an optimization solution with respect to its parameters.

Worked Example: Two-Layer Network

Let g(x)=σ(W1x)g(x) = \sigma(W_1 x) and h(x)=W2g(x)h(x) = W_2 g(x) for activation σ\sigma applied componentwise, with xRnx \in \mathbb{R}^n, W1Rm×nW_1 \in \mathbb{R}^{m \times n}, W2Rp×mW_2 \in \mathbb{R}^{p \times m}. The Jacobian of the inner map is Jg(x)=diag(σ(W1x))W1J_g(x) = \mathrm{diag}(\sigma'(W_1 x)) \cdot W_1, a m×nm \times n matrix. The Jacobian of the outer map at y=g(x)y = g(x) is W2W_2. By the chain rule the composed Jacobian is Jh(x)=W2diag(σ(W1x))W1.J_h(x) = W_2 \cdot \mathrm{diag}(\sigma'(W_1 x)) \cdot W_1. For a scalar loss L(h(x))L(h(x)) the gradient with respect to xx is the VJP xL=W1Tdiag(σ(W1x))W2TL\nabla_x L = W_1^T \cdot \mathrm{diag}(\sigma'(W_1 x)) \cdot W_2^T \cdot \nabla L, evaluated right-to-left, which is precisely what backprop computes.

Example

Chain rule on a single logistic neuron with numbers

Take a single logistic neuron y^=σ(wx+b)\hat{y} = \sigma(w^\top x + b) with σ(z)=1/(1+ez)\sigma(z) = 1/(1 + e^{-z}), weights w=(0.3,0.7)w = (0.3, 0.7)^\top, bias b=0.1b = 0.1, and a training point x=(1.0,0.5)x = (1.0, 0.5)^\top with target y=1y = 1. The squared loss is L=12(y^y)2L = \tfrac{1}{2}(\hat{y} - y)^2. Walk the chain rule one node at a time.

Forward pass. z=0.31.0+0.70.5+0.1=0.75z = 0.3 \cdot 1.0 + 0.7 \cdot 0.5 + 0.1 = 0.75, so y^=σ(0.75)=1/(1+e0.75)0.6792\hat{y} = \sigma(0.75) = 1/(1 + e^{-0.75}) \approx 0.6792. The loss is L12(0.67921)20.0515L \approx \tfrac{1}{2}(0.6792 - 1)^2 \approx 0.0515.

Backward pass. The chain rule gives Lwj=Ly^y^zzwj.\frac{\partial L}{\partial w_j} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w_j}. Each factor has a clean form on this example:

  • L/y^=y^y=0.67921=0.3208\partial L / \partial \hat{y} = \hat{y} - y = 0.6792 - 1 = -0.3208.
  • y^/z=σ(z)=σ(z)(1σ(z))=0.67920.32080.2179\partial \hat{y} / \partial z = \sigma'(z) = \sigma(z)(1 - \sigma(z)) = 0.6792 \cdot 0.3208 \approx 0.2179.
  • z/wj=xj\partial z / \partial w_j = x_j, so z/w1=1.0\partial z / \partial w_1 = 1.0 and z/w2=0.5\partial z / \partial w_2 = 0.5.

Multiplying: L/w1=(0.3208)(0.2179)(1.0)0.0699\partial L / \partial w_1 = (-0.3208)(0.2179)(1.0) \approx -0.0699, L/w2=(0.3208)(0.2179)(0.5)0.0349\partial L / \partial w_2 = (-0.3208)(0.2179)(0.5) \approx -0.0349, L/b=(0.3208)(0.2179)(1.0)0.0699\partial L / \partial b = (-0.3208)(0.2179)(1.0) \approx -0.0699.

A single SGD step with learning rate η=1.0\eta = 1.0 gives wwηwL=(0.3699,0.7349)w \leftarrow w - \eta \nabla_w L = (0.3699,\, 0.7349)^\top and b0.1699b \leftarrow 0.1699. Re-evaluating, z=1.1224z' = 1.1224 and y^0.7544\hat{y}' \approx 0.7544 — closer to the target y=1y = 1.

The intermediate cache (y^,σ(z),x)(\hat{y}, \sigma'(z), x) is exactly the activation state backprop stores during the forward pass to reuse on the backward pass. This sigmoid-of-affine composition is the building block for backpropagation in feedforward networks. Goodfellow, Bengio, Courville, Deep Learning (2016), §6.5.2 walks the same two-step composition with general activations.

Common Confusions

Watch Out

Order of multiplication matters

The product JfJgJ_f \cdot J_g is not the same as JgJfJ_g \cdot J_f in general, because matrix multiplication is non-commutative and even the shapes typically disagree. The outer Jacobian JfJ_f is evaluated at the inner output g(x)g(x) and goes on the left. Reversing the order is the most common chain-rule mistake on multivariable calculus exams.

Watch Out

Gradients are not Jacobians

For a scalar-valued function f:RnRf: \mathbb{R}^n \to \mathbb{R} the gradient f\nabla f is a column vector and the Jacobian is the row vector Jf=(f)TJ_f = (\nabla f)^T. Many sources collapse the distinction and write (fg)=JgTf\nabla(f \circ g) = J_g^T \nabla f, hiding the transpose inside the notation. When you implement autodiff or read papers, track the orientation explicitly: the chain rule gives row vectors via JLJgJ_L J_g and gradients via JgTLJ_g^T \nabla L.

Exercises

ExerciseCore

Problem

Let f(x,y)=x2+y2f(x, y) = x^2 + y^2 and let g(t)=(tcost,tsint)g(t) = (t \cos t, t \sin t) trace out a spiral. Compute ddt(fg)(t)\frac{d}{dt}(f \circ g)(t) via the chain rule, then verify by direct computation.

ExerciseAdvanced

Problem

Use the chain rule to derive the gradient of the squared-loss L(W)=12Wxy2L(W) = \frac{1}{2} \|W x - y\|^2 with respect to WRm×nW \in \mathbb{R}^{m \times n} for fixed xRnx \in \mathbb{R}^n, yRmy \in \mathbb{R}^m. Show that WL=(Wxy)xT\nabla_W L = (W x - y) x^T.

References

  • Walter Rudin. Principles of Mathematical Analysis (3rd ed.). McGraw-Hill, 1976. Theorem 9.15: the chain rule for differentiable maps between Banach spaces. The cleanest statement and proof in the literature.
  • Michael Spivak. Calculus on Manifolds. W. A. Benjamin, 1965. Chapter 2.5: chain rule via best linear approximation. Short, modern, coordinate-free presentation.
  • Tom M. Apostol. Calculus, Volume II (2nd ed.). Wiley, 1969. Sections 8.10-8.12: multivariable chain rule with worked examples and the implicit-function variant.
  • Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind. Automatic Differentiation in Machine Learning: A Survey. JMLR 18, 2018. Section 3 develops forward and reverse mode as the two associativity orderings of chained Jacobians. arXiv:1502.05767
  • Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2nd ed.). SIAM, 2008. Chapter 3: matrix-form chain rule and the cheap-gradient principle.
  • Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics (3rd ed.). Wiley, 2019. Chapter 5: matrix chain rule with the trace inner product, the working tool for matrix gradients in ML.

Related Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

4