~/blog
Chain Rule of Derivatives
Every weight update in a neural network comes down to one question: how does a change in W affect the loss? The problem is that Loss doesn't mention W directly. It mentions the output of a sigmoid, which depends on a weighted sum, which depends on an activation from the previous layer, which finally depends on W. To differentiate through that stack of composed functions, you need the chain rule. Without it, backpropagation has no mathematical foundation.
The anchor example throughout this post uses the two-layer network from earlier posts:
- Input: x = [0.5, 0.1], label y = 0
- W1 = [[0.5, -0.2], [0.3, 0.8]], b1 = [0.1, -0.1]
- W2 = [[0.6], [-0.4]], b2 = [0.0]
- Forward pass results: z1₁ = 0.33, a1₁ = 0.33, z1₂ = 0.13, a1₂ = 0.13, z2 = 0.146, ŷ = 0.536
Why the Chain Rule Exists
A neural network is nothing but a composition of functions:
ŷ = sigmoid( W2 · ReLU( W1 · x + b1 ) + b2 )
L = BCE(ŷ, y)Written as a chain: L = BCE( sigmoid( linear2( ReLU( linear1(x) ) ) ) )
Gradient descent needs ∂L/∂W1 — how much the loss changes when you nudge W1. But if you look at the BCE formula, W1 doesn't appear in it. BCE depends on ŷ. ŷ depends on z2. z2 depends on a1. a1 depends on z1. z1 depends on W1.
The loss does depend on W1, just indirectly — through a chain of intermediate variables. The chain rule formalizes exactly how to follow that chain backward.
Single-Variable Chain Rule
If y = f(g(x)), the derivative of y with respect to x is:
dy/dx = f'(g(x)) · g'(x)You differentiate the outer function (evaluated at the inner function's output), then multiply by the derivative of the inner function. The key insight: the two derivatives multiply — they don't add.
DL example: A single neuron computes z = wx + b, then y = σ(z). What is dy/dw?
y = σ(z), z = wx + b
dy/dw = (dy/dz) · (dz/dw)
= σ'(z) · x
= σ(z)(1 − σ(z)) · xNumerical trace with w = 0.6, x = 0.5, b = 0.1:
z = 0.6 × 0.5 + 0.1 = 0.40
y = σ(0.40) = 1/(1 + e^{-0.40}) = 0.599
σ'(0.40) = 0.599 × (1 − 0.599) = 0.599 × 0.401 = 0.240
dy/dw = 0.240 × 0.5 = 0.120The diagram below shows how derivatives flow backward through the composition:
Multivariable Chain Rule
When a function has multiple inputs, each one gets its own partial derivative. If L = f(z) and z = g(x₁, x₂):
∂L/∂x₁ = (∂L/∂z) · (∂z/∂x₁)
∂L/∂x₂ = (∂L/∂z) · (∂z/∂x₂)DL example: The pre-activation of a neuron is z = w₁x₁ + w₂x₂ + b. The partial derivatives of z with respect to each weight are simply:
∂z/∂w₁ = x₁
∂z/∂w₂ = x₂
∂z/∂b = 1Each weight only multiplies one input, so its partial is just that input. This is a recurring pattern — the gradient of a linear layer with respect to a weight is always the activation feeding into it.
Numerical trace using anchor values x₁ = 0.5, x₂ = 0.1:
z = w₁ · 0.5 + w₂ · 0.1 + b
∂z/∂w₁ = x₁ = 0.5
∂z/∂w₂ = x₂ = 0.1
If ∂L/∂z = 0.536 (from the loss gradient):
∂L/∂w₁ = 0.536 × 0.5 = 0.268
∂L/∂w₂ = 0.536 × 0.1 = 0.054The structure is always the same: multiply the incoming gradient by the local partial. The incoming gradient is whatever arrived from later in the network.
Chaining Through Multiple Layers
The full network from the anchor has four steps between the loss and W1:
L ← z2 ← a1₁ ← z1₁ ← W1₁₁Unrolling every partial:
∂L/∂W1₁₁ = (∂L/∂z2) · (∂z2/∂a1₁) · (∂a1₁/∂z1₁) · (∂z1₁/∂W1₁₁)Each factor has a precise meaning:
| Factor | Meaning | Value |
|---|---|---|
| ∂L/∂z2 | How much the loss changes per unit change in z2 | ŷ − y = 0.536 |
| ∂z2/∂a1₁ | How much z2 changes per unit change in a1₁ | W2₁ = 0.6 |
| ∂a1₁/∂z1₁ | How much the ReLU output changes per unit change in z1₁ | ReLU'(0.33) = 1.0 |
| ∂z1₁/∂W1₁₁ | How much z1₁ changes per unit change in W1₁₁ | x₁ = 0.5 |
Substituting all four:
∂L/∂W1₁₁ = 0.536 × 0.6 × 1.0 × 0.5 = 0.161Computational Graph Representation
The computation graph makes the chain rule visual. Each node holds an operation; each edge carries a value (forward pass) and a gradient (backward pass).
Common Derivatives in DL
Every backward pass reuses a short list of derivatives. Each one below is derived from first principles — two algebraic steps minimum.
Sigmoid
σ(z) = 1 / (1 + e^{−z})
Step 1 — quotient rule:
dσ/dz = e^{−z} / (1 + e^{−z})²
Step 2 — rewrite using σ:
e^{−z} / (1 + e^{−z})² = [1/(1+e^{−z})] · [e^{−z}/(1+e^{−z})]
= σ(z) · (1 − σ(z))
∴ σ'(z) = σ(z)(1 − σ(z))Tanh
tanh(z) = (e^z − e^{−z}) / (e^z + e^{−z})
Step 1 — quotient rule:
d/dz tanh(z) = [(e^z + e^{−z})(e^z + e^{−z}) − (e^z − e^{−z})(e^z − e^{−z})]
/ (e^z + e^{−z})²
= [(e^z + e^{−z})² − (e^z − e^{−z})²] / (e^z + e^{−z})²
Step 2 — simplify using difference of squares:
numerator = 4 (since (a+b)²−(a−b)² = 4ab, and here ab = 1 for e^z·e^{−z})
= 4 / (e^z + e^{−z})² = 1 − tanh²(z)
∴ tanh'(z) = 1 − tanh²(z)ReLU
ReLU(z) = max(0, z) = { z if z > 0; 0 if z ≤ 0 }
Step 1 — differentiate each piece:
d/dz (z) = 1 for z > 0
d/dz (0) = 0 for z < 0
Step 2 — combine:
ReLU'(z) = { 1 if z > 0; 0 if z < 0 }
At exactly z = 0, ReLU is not differentiable (kink in the function).
In practice, frameworks use a subgradient of 0 there.MSE Loss
MSE = (ŷ − y)² / n (for one sample, n = 1)
Step 1 — power rule:
d/dŷ [(ŷ − y)²] = 2(ŷ − y)
Step 2 — divide by n:
∂MSE/∂ŷ = 2(ŷ − y) / n
For anchor: 2(0.536 − 0) / 1 = 1.072Binary Cross-Entropy Loss
BCE = −y·log(ŷ) − (1 − y)·log(1 − ŷ)
Step 1 — differentiate each term:
d/dŷ [−y·log(ŷ)] = −y / ŷ
d/dŷ [−(1−y)·log(1−ŷ)] = (1−y) / (1−ŷ)
Step 2 — combine over common denominator:
∂BCE/∂ŷ = −y/ŷ + (1−y)/(1−ŷ)
= [−y(1−ŷ) + (1−y)ŷ] / [ŷ(1−ŷ)]
= (ŷ − y) / [ŷ(1−ŷ)]
For anchor (y=0): (0.536 − 0) / (0.536 × 0.464) = 0.536 / 0.249 ≈ 2.153| Function | f(z) | f'(z) |
|---|---|---|
| Sigmoid | 1/(1+e^−z) | σ(z)(1−σ(z)) |
| Tanh | (e^z−e^−z)/(e^z+e^−z) | 1−tanh²(z) |
| ReLU | max(0,z) | 1 if z>0, else 0 |
| MSE | (ŷ−y)²/n | 2(ŷ−y)/n |
| BCE | −y·log(ŷ)−(1−y)·log(1−ŷ) | (ŷ−y)/(ŷ(1−ŷ)) |
Gradient Trace: L to W1₁₁
Walking every step with anchor values:
| Step | Expression | Values | Result |
|---|---|---|---|
| ∂L/∂ŷ | (ŷ − y) / (ŷ(1−ŷ)) | (0.536−0) / (0.536×0.464) | 2.153 |
| ∂L/∂z2 | ∂L/∂ŷ · σ'(z2) = ∂L/∂ŷ · ŷ(1−ŷ) | 2.153 × 0.536 × 0.464 | 0.536 |
| ∂z2/∂a1₁ | W2₁ | — | 0.6 |
| ∂L/∂a1₁ | ∂L/∂z2 × ∂z2/∂a1₁ | 0.536 × 0.6 | 0.322 |
| ∂a1₁/∂z1₁ | ReLU'(0.33) | z1₁ = 0.33 > 0 | 1.0 |
| ∂L/∂z1₁ | ∂L/∂a1₁ × ∂a1₁/∂z1₁ | 0.322 × 1.0 | 0.322 |
| ∂z1₁/∂W1₁₁ | x₁ | — | 0.5 |
| ∂L/∂W1₁₁ | ∂L/∂z1₁ × ∂z1₁/∂W1₁₁ | 0.322 × 0.5 | 0.161 |
Row 2 shows a useful simplification: when you compose BCE with sigmoid, the ugly division cancels and you're left with just ŷ − y. This is why BCE and sigmoid are nearly always paired — the gradient is as clean as it gets.
Verifying with Finite Differences
The chain rule gives an analytical gradient. Finite differences give a numerical gradient. When they match, the derivation is correct.
import numpy as np
# Demonstrate chain rule numerically with finite differences (for verification)
def sigmoid(z): return 1 / (1 + np.exp(-z))
def sigmoid_grad(z): s = sigmoid(z); return s * (1 - s)
z = 0.40; w = 0.6; x = 0.5
# Analytical
dy_dw_analytical = sigmoid_grad(z) * x
# Numerical (finite difference)
eps = 1e-5
dy_dw_numerical = (sigmoid(z + eps*x) - sigmoid(z - eps*x)) / (2*eps)
print(f"Analytical: {dy_dw_analytical:.6f}")
print(f"Numerical: {dy_dw_numerical:.6f}")
print(f"Match: {np.isclose(dy_dw_analytical, dy_dw_numerical)}")Analytical: 0.120002
Numerical: 0.120002
Match: TrueThe match confirms the chain rule derivation. The finite difference approximation perturbs w by a tiny amount and measures the slope of y over that interval — no calculus required. The fact that both methods produce the same number means the analytical chain-rule application was correct.
Limitations
Finite differences are a debugging tool, not a training tool. Checking one gradient analytically costs two forward passes. A network with a million parameters needs two million forward passes to verify all gradients numerically. Gradient checking is used during development to validate a new layer's backward implementation; never during training.
ReLU is not differentiable at z = 0. The chain rule requires every function in the chain to be differentiable at the point you evaluate it. ReLU has a sharp corner at z = 0 — the left derivative is 0 and the right derivative is 1. In practice, frameworks assign a subgradient of either 0 or 1 at z = 0 and training proceeds without issue because the probability of hitting exactly 0 is vanishingly small on real-valued activations.
Very deep chains multiply many small numbers together. If each activation's derivative is less than 1 — which sigmoid's always is (maximum 0.25 at z = 0) — then a product of 50 such values collapses toward zero. The gradient arriving at early layers becomes numerically negligible and those weights stop learning. This is the vanishing gradient problem, a direct consequence of the chain rule operating over many layers with saturating activations.
Related Concepts
The chain rule rests on two things from basic calculus: the notion of a partial derivative (what changes when you move one variable while holding others fixed) and function composition (the output of one function fed as input to another). If either of those is shaky, the multivariable chain rule won't click.
What the chain rule unlocks is backpropagation — the algorithm that applies it systematically to every weight in a network in a single backward pass. Beyond that, automatic differentiation systems (PyTorch's autograd, TensorFlow's gradient tape) implement the chain rule at the level of operator primitives, so any composed function you write is automatically differentiable. Understanding the chain rule is what makes those systems legible rather than magical.
Test Your Understanding
-
A network uses
y = tanh(z),z = wx + b. Write outdy/dwusing the chain rule, then evaluate it atw = 0.3,x = 0.5,b = 0.0. -
The anchor network has a second hidden neuron with
z1₂ = 0.13,a1₂ = 0.13, andW2₂ = -0.4. Compute∂L/∂W1₂₁(the gradient of the loss with respect toW1row 2, column 1), using the same 4-step chain structure. -
Suppose you replace ReLU in the anchor with sigmoid. Recompute
∂a1₁/∂z1₁atz1₁ = 0.33and explain how this changes∂L/∂W1₁₁. -
The finite difference formula uses
(f(z + ε·x) − f(z − ε·x)) / (2ε)rather than(f(z + ε) − f(z)) / ε. Why does the symmetric (centered) version give a more accurate approximation, and what order of error does each have in terms of ε? -
Consider a 10-layer network where every activation is sigmoid with maximum derivative 0.25. Without any other changes, estimate the order of magnitude of
∂L/∂W1relative to∂L/∂W10. What does this imply for learning speed in early versus late layers?