~/blog

Chain Rule of Derivatives

Jun 29, 202612 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Every weight update in a neural network comes down to one question: how does a change in W affect the loss? The problem is that Loss doesn't mention W directly. It mentions the output of a sigmoid, which depends on a weighted sum, which depends on an activation from the previous layer, which finally depends on W. To differentiate through that stack of composed functions, you need the chain rule. Without it, backpropagation has no mathematical foundation.

The anchor example throughout this post uses the two-layer network from earlier posts:

  • Input: x = [0.5, 0.1], label y = 0
  • W1 = [[0.5, -0.2], [0.3, 0.8]], b1 = [0.1, -0.1]
  • W2 = [[0.6], [-0.4]], b2 = [0.0]
  • Forward pass results: z1₁ = 0.33, a1₁ = 0.33, z1₂ = 0.13, a1₂ = 0.13, z2 = 0.146, ŷ = 0.536

Why the Chain Rule Exists

A neural network is nothing but a composition of functions:

text
ŷ = sigmoid( W2 · ReLU( W1 · x + b1 ) + b2 )
L = BCE(ŷ, y)

Written as a chain: L = BCE( sigmoid( linear2( ReLU( linear1(x) ) ) ) )

Gradient descent needs ∂L/∂W1 — how much the loss changes when you nudge W1. But if you look at the BCE formula, W1 doesn't appear in it. BCE depends on ŷ. ŷ depends on z2. z2 depends on a1. a1 depends on z1. z1 depends on W1.

The loss does depend on W1, just indirectly — through a chain of intermediate variables. The chain rule formalizes exactly how to follow that chain backward.


Single-Variable Chain Rule

If y = f(g(x)), the derivative of y with respect to x is:

text
dy/dx = f'(g(x)) · g'(x)

You differentiate the outer function (evaluated at the inner function's output), then multiply by the derivative of the inner function. The key insight: the two derivatives multiply — they don't add.

DL example: A single neuron computes z = wx + b, then y = σ(z). What is dy/dw?

text
y = σ(z),   z = wx + b

dy/dw = (dy/dz) · (dz/dw)
      = σ'(z) · x
      = σ(z)(1 − σ(z)) · x

Numerical trace with w = 0.6, x = 0.5, b = 0.1:

text
z = 0.6 × 0.5 + 0.1 = 0.40
y = σ(0.40) = 1/(1 + e^{-0.40}) = 0.599

σ'(0.40) = 0.599 × (1 − 0.599) = 0.599 × 0.401 = 0.240

dy/dw = 0.240 × 0.5 = 0.120

The diagram below shows how derivatives flow backward through the composition:

w = 0.6 weight z = wx + b = 0.40 dz/dw = x = 0.5 y = σ(z) = 0.599 dy/dz = σ'(z) = 0.240 dy/dw = 0.120 backward: 0.240 × 0.5 = 0.120

Multivariable Chain Rule

When a function has multiple inputs, each one gets its own partial derivative. If L = f(z) and z = g(x₁, x₂):

text
∂L/∂x₁ = (∂L/∂z) · (∂z/∂x₁)
∂L/∂x₂ = (∂L/∂z) · (∂z/∂x₂)

DL example: The pre-activation of a neuron is z = w₁x₁ + w₂x₂ + b. The partial derivatives of z with respect to each weight are simply:

text
∂z/∂w₁ = x₁
∂z/∂w₂ = x₂
∂z/∂b  = 1

Each weight only multiplies one input, so its partial is just that input. This is a recurring pattern — the gradient of a linear layer with respect to a weight is always the activation feeding into it.

Numerical trace using anchor values x₁ = 0.5, x₂ = 0.1:

text
z = w₁ · 0.5 + w₂ · 0.1 + b

∂z/∂w₁ = x₁ = 0.5
∂z/∂w₂ = x₂ = 0.1

If ∂L/∂z = 0.536 (from the loss gradient):
  ∂L/∂w₁ = 0.536 × 0.5 = 0.268
  ∂L/∂w₂ = 0.536 × 0.1 = 0.054

The structure is always the same: multiply the incoming gradient by the local partial. The incoming gradient is whatever arrived from later in the network.


Chaining Through Multiple Layers

The full network from the anchor has four steps between the loss and W1:

text
L  ←  z2  ←  a1₁  ←  z1₁  ←  W1₁₁

Unrolling every partial:

text
∂L/∂W1₁₁ = (∂L/∂z2) · (∂z2/∂a1₁) · (∂a1₁/∂z1₁) · (∂z1₁/∂W1₁₁)

Each factor has a precise meaning:

FactorMeaningValue
∂L/∂z2How much the loss changes per unit change in z2ŷ − y = 0.536
∂z2/∂a1₁How much z2 changes per unit change in a1₁W2₁ = 0.6
∂a1₁/∂z1₁How much the ReLU output changes per unit change in z1₁ReLU'(0.33) = 1.0
∂z1₁/∂W1₁₁How much z1₁ changes per unit change in W1₁₁x₁ = 0.5

Substituting all four:

text
∂L/∂W1₁₁ = 0.536 × 0.6 × 1.0 × 0.5 = 0.161
L BCE loss ∂L/∂z2 = 0.536 z2 = 0.146 ∂z2/∂a1₁ = 0.6 a1₁ = 0.33 ∂a1₁/∂z1₁ = 1.0 z1₁ = 0.33 ∂z1₁/∂W1₁₁ = 0.5 W1₁₁ = 0.5 ∂L/∂W1₁₁ = 0.536 × 0.6 × 1.0 × 0.5 = 0.161

Computational Graph Representation

The computation graph makes the chain rule visual. Each node holds an operation; each edge carries a value (forward pass) and a gradient (backward pass).

Forward Pass (values) Backward Pass (gradients) x [0.5, 0.1] z1 [0.33, 0.13] a1 [0.33, 0.13] W1·x+b1 linear z2 = 0.146 linear ŷ = 0.536 sigmoid ∂L/∂ŷ = 0.536 ∂L/∂z2 = 0.536 ∂L/∂a1₁ = 0.322 ∂L/∂z1₁ = 0.322 ∂L/∂W1₁₁ = 0.161 ×0.6

Common Derivatives in DL

Every backward pass reuses a short list of derivatives. Each one below is derived from first principles — two algebraic steps minimum.

Sigmoid

text
σ(z) = 1 / (1 + e^{−z})

Step 1 — quotient rule:
  dσ/dz = e^{−z} / (1 + e^{−z})²

Step 2 — rewrite using σ:
  e^{−z} / (1 + e^{−z})² = [1/(1+e^{−z})] · [e^{−z}/(1+e^{−z})]
                          = σ(z) · (1 − σ(z))

∴ σ'(z) = σ(z)(1 − σ(z))

Tanh

text
tanh(z) = (e^z − e^{−z}) / (e^z + e^{−z})

Step 1 — quotient rule:
  d/dz tanh(z) = [(e^z + e^{−z})(e^z + e^{−z}) − (e^z − e^{−z})(e^z − e^{−z})]
                 / (e^z + e^{−z})²
               = [(e^z + e^{−z})² − (e^z − e^{−z})²] / (e^z + e^{−z})²

Step 2 — simplify using difference of squares:
  numerator = 4  (since (a+b)²−(a−b)² = 4ab, and here ab = 1 for e^z·e^{−z})
  = 4 / (e^z + e^{−z})² = 1 − tanh²(z)

∴ tanh'(z) = 1 − tanh²(z)

ReLU

text
ReLU(z) = max(0, z) = { z if z > 0; 0 if z ≤ 0 }

Step 1 — differentiate each piece:
  d/dz (z)  = 1   for z > 0
  d/dz (0)  = 0   for z < 0

Step 2 — combine:
  ReLU'(z) = { 1 if z > 0; 0 if z < 0 }

At exactly z = 0, ReLU is not differentiable (kink in the function).
In practice, frameworks use a subgradient of 0 there.

MSE Loss

text
MSE = (ŷ − y)² / n    (for one sample, n = 1)

Step 1 — power rule:
  d/dŷ [(ŷ − y)²] = 2(ŷ − y)

Step 2 — divide by n:
  ∂MSE/∂ŷ = 2(ŷ − y) / n

For anchor: 2(0.536 − 0) / 1 = 1.072

Binary Cross-Entropy Loss

text
BCE = −y·log(ŷ) − (1 − y)·log(1 − ŷ)

Step 1 — differentiate each term:
  d/dŷ [−y·log(ŷ)]           = −y / ŷ
  d/dŷ [−(1−y)·log(1−ŷ)]    = (1−y) / (1−ŷ)

Step 2 — combine over common denominator:
  ∂BCE/∂ŷ = −y/ŷ + (1−y)/(1−ŷ)
           = [−y(1−ŷ) + (1−y)ŷ] / [ŷ(1−ŷ)]
           = (ŷ − y) / [ŷ(1−ŷ)]

For anchor (y=0): (0.536 − 0) / (0.536 × 0.464) = 0.536 / 0.249 ≈ 2.153
Functionf(z)f'(z)
Sigmoid1/(1+e^−z)σ(z)(1−σ(z))
Tanh(e^z−e^−z)/(e^z+e^−z)1−tanh²(z)
ReLUmax(0,z)1 if z>0, else 0
MSE(ŷ−y)²/n2(ŷ−y)/n
BCE−y·log(ŷ)−(1−y)·log(1−ŷ)(ŷ−y)/(ŷ(1−ŷ))

Gradient Trace: L to W1₁₁

Walking every step with anchor values:

StepExpressionValuesResult
∂L/∂ŷ(ŷ − y) / (ŷ(1−ŷ))(0.536−0) / (0.536×0.464)2.153
∂L/∂z2∂L/∂ŷ · σ'(z2) = ∂L/∂ŷ · ŷ(1−ŷ)2.153 × 0.536 × 0.4640.536
∂z2/∂a1₁W2₁0.6
∂L/∂a1₁∂L/∂z2 × ∂z2/∂a1₁0.536 × 0.60.322
∂a1₁/∂z1₁ReLU'(0.33)z1₁ = 0.33 > 01.0
∂L/∂z1₁∂L/∂a1₁ × ∂a1₁/∂z1₁0.322 × 1.00.322
∂z1₁/∂W1₁₁x₁0.5
∂L/∂W1₁₁∂L/∂z1₁ × ∂z1₁/∂W1₁₁0.322 × 0.50.161

Row 2 shows a useful simplification: when you compose BCE with sigmoid, the ugly division cancels and you're left with just ŷ − y. This is why BCE and sigmoid are nearly always paired — the gradient is as clean as it gets.


Verifying with Finite Differences

The chain rule gives an analytical gradient. Finite differences give a numerical gradient. When they match, the derivation is correct.

python
import numpy as np

# Demonstrate chain rule numerically with finite differences (for verification)
def sigmoid(z): return 1 / (1 + np.exp(-z))
def sigmoid_grad(z): s = sigmoid(z); return s * (1 - s)

z = 0.40; w = 0.6; x = 0.5
# Analytical
dy_dw_analytical = sigmoid_grad(z) * x
# Numerical (finite difference)
eps = 1e-5
dy_dw_numerical = (sigmoid(z + eps*x) - sigmoid(z - eps*x)) / (2*eps)
print(f"Analytical: {dy_dw_analytical:.6f}")
print(f"Numerical:  {dy_dw_numerical:.6f}")
print(f"Match: {np.isclose(dy_dw_analytical, dy_dw_numerical)}")
text
Analytical: 0.120002
Numerical:  0.120002
Match: True

The match confirms the chain rule derivation. The finite difference approximation perturbs w by a tiny amount and measures the slope of y over that interval — no calculus required. The fact that both methods produce the same number means the analytical chain-rule application was correct.


Limitations

Finite differences are a debugging tool, not a training tool. Checking one gradient analytically costs two forward passes. A network with a million parameters needs two million forward passes to verify all gradients numerically. Gradient checking is used during development to validate a new layer's backward implementation; never during training.

ReLU is not differentiable at z = 0. The chain rule requires every function in the chain to be differentiable at the point you evaluate it. ReLU has a sharp corner at z = 0 — the left derivative is 0 and the right derivative is 1. In practice, frameworks assign a subgradient of either 0 or 1 at z = 0 and training proceeds without issue because the probability of hitting exactly 0 is vanishingly small on real-valued activations.

Very deep chains multiply many small numbers together. If each activation's derivative is less than 1 — which sigmoid's always is (maximum 0.25 at z = 0) — then a product of 50 such values collapses toward zero. The gradient arriving at early layers becomes numerically negligible and those weights stop learning. This is the vanishing gradient problem, a direct consequence of the chain rule operating over many layers with saturating activations.


The chain rule rests on two things from basic calculus: the notion of a partial derivative (what changes when you move one variable while holding others fixed) and function composition (the output of one function fed as input to another). If either of those is shaky, the multivariable chain rule won't click.

What the chain rule unlocks is backpropagation — the algorithm that applies it systematically to every weight in a network in a single backward pass. Beyond that, automatic differentiation systems (PyTorch's autograd, TensorFlow's gradient tape) implement the chain rule at the level of operator primitives, so any composed function you write is automatically differentiable. Understanding the chain rule is what makes those systems legible rather than magical.


Test Your Understanding

  1. A network uses y = tanh(z), z = wx + b. Write out dy/dw using the chain rule, then evaluate it at w = 0.3, x = 0.5, b = 0.0.

  2. The anchor network has a second hidden neuron with z1₂ = 0.13, a1₂ = 0.13, and W2₂ = -0.4. Compute ∂L/∂W1₂₁ (the gradient of the loss with respect to W1 row 2, column 1), using the same 4-step chain structure.

  3. Suppose you replace ReLU in the anchor with sigmoid. Recompute ∂a1₁/∂z1₁ at z1₁ = 0.33 and explain how this changes ∂L/∂W1₁₁.

  4. The finite difference formula uses (f(z + ε·x) − f(z − ε·x)) / (2ε) rather than (f(z + ε) − f(z)) / ε. Why does the symmetric (centered) version give a more accurate approximation, and what order of error does each have in terms of ε?

  5. Consider a 10-layer network where every activation is sigmoid with maximum derivative 0.25. Without any other changes, estimate the order of magnitude of ∂L/∂W1 relative to ∂L/∂W10. What does this imply for learning speed in early versus late layers?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment