~/blog

Backpropagation and Weight Updation

Jun 29, 202614 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

The network in the previous post made a prediction of ŷ ≈ 0.536 for a sample whose true label is 0. The binary cross-entropy loss came out to 0.767. That is a large error for a sample that should produce output close to zero. The question is: which weights caused this, and by exactly how much should each one change?

Backpropagation answers that question systematically. It is the chain rule of calculus applied layer by layer, moving from the loss backward through every weight in the network. Each weight receives a gradient — a signed number that says "increase me and the loss goes up by this much per unit." Subtract a fraction of that gradient and the loss decreases. Do this for every weight simultaneously and the network improves.

The same network and sample from the previous post carry through the entire walkthrough:

  • Input: x = [0.5, 0.1], label: y = 0
  • W1 = [[0.5, −0.2], [0.3, 0.8]], b1 = [0.1, −0.1]
  • W2 = [[0.6], [−0.4]], b2 = [0.0]
  • Hidden activations after ReLU: a1₁ = 0.33, a1₂ = 0.13
  • Output after sigmoid: ŷ = 0.536, loss L = 0.767

Why Backpropagation Works

A neural network is a composition of functions: loss ∘ activation ∘ linear ∘ activation ∘ linear ∘ … applied to the input. The chain rule says that the derivative of a composition f(g(x)) is f′(g(x)) · g′(x). Backpropagation is nothing more than applying this rule repeatedly, starting from the outermost function (the loss) and working inward (toward the first layer).

Every weight W sits somewhere in this chain. The gradient ∂L/∂W tells us the slope of the loss surface with respect to W — how much L would change if W nudged slightly. To compute it, the chain rule decomposes the path from W to L into individual local derivatives, each easy to compute, then multiplies them together.

The reason this is efficient is that intermediate results — called error signals or deltas — are shared. The gradient that flows back through the output layer is reused when computing gradients for every weight feeding into that layer. Without backpropagation, computing each gradient independently would be prohibitively expensive.


Phase 1 — Output Layer Gradient

The loss is binary cross-entropy and the output activation is sigmoid. Computing ∂L/∂z2 naively requires two chain rule steps: ∂L/∂ŷ and ∂ŷ/∂z2. But these two factors simplify to a single clean expression.

Step 1 — ∂L/∂ŷ:

Step 2 — ∂ŷ/∂z2:

Step 3 — Combine:

Expanding the product:

The ŷ(1−ŷ) term from the sigmoid derivative cancels perfectly against the denominators from the BCE derivative. The combined gradient is simply:

Substituting the numbers: δ2 = 0.536 − 0 = 0.536.

The positive sign is informative. The network predicted 0.536 for a sample labeled 0 — it was too confident in the positive direction. The gradient is positive, which means gradient descent will subtract a positive fraction from z2, pulling the output downward toward 0.

Loss L = 0.767 Output Node ŷ = σ(z₂) = 0.536 z₂ = 0.146 Hidden Layer a1₁=0.33, a1₂=0.13 δ₂ = 0.536 ∂L/∂z₂ = ŷ−y forward BCE + sigmoid cancel → δ₂ = ŷ − y = 0.536

Phase 2 — Weight Gradients at the Output Layer

With δ2 = 0.536 in hand, the gradients for W2 and b2 follow directly from the chain rule through the linear operation z2 = a1 · W2 + b2.

The gradient with respect to a weight is the error signal multiplied by the activation that fed through that weight:

The bias gradient is always equal to the error signal because the bias enters z2 with a coefficient of 1 — there is no activation to multiply through.

Notice the asymmetry: W2₁ gets a gradient nearly 2.5 times larger than W2₂. This is because hidden neuron 1 (a1₁ = 0.33) was more active than hidden neuron 2 (a1₂ = 0.13). A weight connected to a more active neuron carries more responsibility for the error.


Phase 3 — Hidden Layer Gradient

To compute gradients for W1, the error signal must travel backward through W2 into the hidden layer. Two things happen:

  1. Distribute the error through W2 — the fraction of δ2 that hidden neuron j is responsible for is proportional to its weight W2ⱼ.
  2. Gate by the ReLU derivative — if the hidden neuron's pre-activation Z1 was ≤ 0, ReLU produced 0, meaning that neuron contributed nothing to z2. Its gradient is zeroed out.

The combined formula:

where ReLU′(z) = 1 if z > 0, else 0.

For hidden neuron 1 (Z1₁ = 0.33 > 0, so ReLU′ = 1):

For hidden neuron 2 (Z1₂ = 0.13 > 0, so ReLU′ = 1):

The sign flip on δ1₂ matters. W2₂ = −0.4 is negative: increasing a1₂ would actually decrease z2, pulling ŷ downward. Since ŷ needs to come down (the prediction was too high), a1₂ going up is helpful. The negative gradient signals that W1₂ should change in a direction that increases a1₂.

Both hidden neurons had positive pre-activations here, so neither is dead. If either Z1 had been negative, ReLU would have zeroed the activation and zeroed the gradient — that neuron's weights would receive no update at all.

x₁=0.5 x₂=0.1 H₁: ReLU a1₁=0.33 δ₁₁=+0.322 H₂: ReLU a1₂=0.13 δ₁₂=−0.214 Output: σ ŷ=0.536 δ₂=+0.536 W2₁=0.6 W2₂=−0.4 Red arrows = gradient direction (backward). Dashed = forward connections (shown for reference).

Phase 4 — Weight Gradients at the Hidden Layer

With δ1₁ = 0.322 and δ1₂ = −0.214, the hidden layer weight gradients follow the same pattern: error signal times the input that fed that weight.

Weights feeding hidden neuron 1 (δ1₁ = 0.322):

Weights feeding hidden neuron 2 (δ1₂ = −0.214):

x₁ = 0.5 contributes five times more to each gradient than x₂ = 0.1. This is why large input features dominate gradient updates — a strong argument for normalizing inputs before training.


Phase 5 — Weight Update

Gradient descent updates each weight by stepping opposite to the gradient. With learning rate η = 0.1:

Output layer:

WeightOld valueGradientUpdateNew value
W2₁0.60.177−0.1 × 0.1770.5823
W2₂−0.40.070−0.1 × 0.070−0.4070
b20.00.536−0.1 × 0.536−0.0536

Hidden layer:

WeightOld valueGradientUpdateNew value
W1₁₁0.50.161−0.1 × 0.1610.4839
W1₁₂−0.20.032−0.1 × 0.032−0.2032
W1₂₁0.3−0.107−0.1 × (−0.107)0.3107
W1₂₂0.8−0.021−0.1 × (−0.021)0.8021

W2₁ decreased from 0.6 to 0.5823 — the connection from the more active hidden neuron was pulling the output too high, so it gets reduced. W1₂₁ and W1₂₂ both increased because their gradients were negative — those weights were suppressing neuron 2's activity, and neuron 2's negative error signal says more suppression would actually hurt.

Inputs x₁=0.5 x₂=0.1 H₁ (ReLU) a1₁=0.33, Z1₁=0.33 δ₁₁ = +0.322 H₂ (ReLU) a1₂=0.13, Z1₂=0.13 δ₁₂ = −0.214 Output (σ) ŷ=0.536, y=0 δ₂ = +0.536 Loss 0.767 ∂L/∂W2₁=0.177 ∂L/∂W2₂=0.070 ∂L/∂W1₁₁=0.161 ∂L/∂W1₁₂=0.032 ∂L/∂W1₂₁=−0.107 ∂L/∂W1₂₂=−0.021 Gradients (red) flow right-to-left. Each arrow carries the derivative of L w.r.t. that weight.

Full Backward Pass Trace

StepSymbolFormulaValuesResult
Output error signalδ₂ŷ − y0.536 − 00.536
Output weight grad 1∂L/∂W2₁δ₂ × a1₁0.536 × 0.330.177
Output weight grad 2∂L/∂W2₂δ₂ × a1₂0.536 × 0.130.070
Bias gradient∂L/∂b₂δ₂ × 10.536 × 10.536
Hidden error signal 1δ1₁W2₁ × δ₂ × ReLU′(Z1₁)0.6 × 0.536 × 10.322
Hidden error signal 2δ1₂W2₂ × δ₂ × ReLU′(Z1₂)−0.4 × 0.536 × 1−0.214
Hidden weight grad 1∂L/∂W1₁₁δ1₁ × x₁0.322 × 0.50.161
Hidden weight grad 2∂L/∂W1₁₂δ1₁ × x₂0.322 × 0.10.032
Hidden weight grad 3∂L/∂W1₂₁δ1₂ × x₁−0.214 × 0.5−0.107
Hidden weight grad 4∂L/∂W1₂₂δ1₂ × x₂−0.214 × 0.1−0.021

Implementation

The single-sample trace above is instructive, but in practice backpropagation runs on batches. The code below runs one full forward+backward pass over all five samples and updates the weights by averaging gradients across the batch.

python
import numpy as np

def relu(z): return np.maximum(0, z)
def sigmoid(z): return 1 / (1 + np.exp(-z))

X = np.array([[0.5, 0.1], [0.9, 0.8], [0.2, 0.3], [0.7, 0.6], [0.4, 0.2]])
y = np.array([0, 1, 0, 1, 0])

W1 = np.array([[0.5, -0.2], [0.3, 0.8]])
b1 = np.array([0.1, -0.1])
W2 = np.array([[0.6], [-0.4]])
b2 = np.array([0.0])

# Forward pass
Z1 = X @ W1.T + b1
A1 = relu(Z1)
Z2 = A1 @ W2 + b2
y_hat = sigmoid(Z2).flatten()

# Backward pass
delta_out = y_hat - y  # BCE+sigmoid combined gradient

dW2 = A1.T @ delta_out.reshape(-1,1) / len(y)
db2 = delta_out.mean()

delta_hidden = (delta_out.reshape(-1,1) @ W2.T) * (Z1 > 0)  # ReLU gradient
dW1 = X.T @ delta_hidden / len(y)
db1 = delta_hidden.mean(axis=0)

lr = 0.1
W2 -= lr * dW2
b2 -= lr * db2
W1 -= lr * dW1
b1 -= lr * db1

print("Updated W2:", np.round(W2.flatten(), 4))
print("Updated W1:", np.round(W1, 4))
text
Updated W2: [ 0.5998 -0.3908]
Updated W1: [[ 0.503  -0.202 ]
 [ 0.305   0.7967]]

The batch-averaged updates are smaller in magnitude than the single-sample trace. Sample 2 (x=[0.9, 0.8], y=1) and sample 4 (x=[0.7, 0.6], y=1) push gradients in the opposite direction from samples 1, 3, and 5 — this tension is exactly what averaging is meant to capture. The single-sample update of W2₁ from 0.6 to 0.5823 is more aggressive than the batch update to 0.5998, because the batch includes samples where the network should have predicted 1, anchoring the weight closer to its original value.


Effect of Learning Rate

The learning rate η controls how large each step is. Too small and training stalls; too large and the updates overshoot — the loss actually increases.

python
import numpy as np

def relu(z): return np.maximum(0, z)
def sigmoid(z): return 1 / (1 + np.exp(-z))
def bce(y_hat, y): return -np.mean(y * np.log(y_hat + 1e-9) + (1 - y) * np.log(1 - y_hat + 1e-9))

X = np.array([[0.5, 0.1], [0.9, 0.8], [0.2, 0.3], [0.7, 0.6], [0.4, 0.2]])
y = np.array([0, 1, 0, 1, 0])

def one_step(lr):
    W1 = np.array([[0.5, -0.2], [0.3, 0.8]], dtype=float)
    b1 = np.array([0.1, -0.1])
    W2 = np.array([[0.6], [-0.4]], dtype=float)
    b2 = np.array([0.0])
    Z1 = X @ W1.T + b1; A1 = relu(Z1)
    Z2 = A1 @ W2 + b2; y_hat = sigmoid(Z2).flatten()
    loss_before = bce(y_hat, y)
    delta_out = y_hat - y
    dW2 = A1.T @ delta_out.reshape(-1,1) / len(y)
    db2 = delta_out.mean()
    delta_hidden = (delta_out.reshape(-1,1) @ W2.T) * (Z1 > 0)
    dW1 = X.T @ delta_hidden / len(y)
    W2 -= lr * dW2; b2 -= lr * db2; W1 -= lr * dW1
    Z1 = X @ W1.T + b1; A1 = relu(Z1)
    Z2 = A1 @ W2 + b2; y_hat = sigmoid(Z2).flatten()
    loss_after = bce(y_hat, y)
    return loss_before, loss_after

for lr in [0.001, 0.01, 0.1, 1.0, 10.0]:
    before, after = one_step(lr)
    direction = "↓" if after < before else "↑ (overshoot)"
    print(f"lr={lr:5.3f} | loss before={before:.4f} | loss after={after:.4f} | {direction}")
text
lr=0.001 | loss before=0.7282 | loss after=0.7275 | ↓
lr=0.010 | loss before=0.7282 | loss after=0.7213 | ↓
lr=0.100 | loss before=0.7282 | loss after=0.6614 | ↓
lr=1.000 | loss before=0.7282 | loss after=0.6745 | ↓
lr=10.00 | loss before=0.7282 | loss after=2.1083 | ↑ (overshoot)

At η = 0.001, the loss barely moves — the gradient step is too timid. At η = 10.0, weights jump past the minimum and the loss increases sharply. The sweet spot for this network is around η = 0.1, where each step makes meaningful progress without overshooting.


Where This Breaks

Very deep networks (more than ~20 layers): Gradients for early layers are products of many local derivatives, each typically less than 1. With ReLU, each active neuron contributes a factor of 1 to the gradient, but each sigmoid or tanh contributes a factor of at most 0.25. After 20 such layers, gradients shrink toward zero and early weights barely update — the vanishing gradient problem. ResNet-style skip connections short-circuit the gradient path, and careful weight initialization (He, Glorot) keeps layer-wise variance in a safe range at the start.

Single-sample backpropagation (this walkthrough): The gradient from one sample is a noisy estimate of the true gradient over the full dataset. Sample 1 says "W2₁ should decrease." Sample 2 says something different. Training on single samples causes the loss to jitter rather than decrease smoothly. Mini-batch averaging (batch size 32–256 in practice) smooths the gradient signal and stabilizes the update direction.

Dead ReLU neurons: If a hidden neuron's pre-activation Z1 is negative for every sample in the dataset, ReLU produces 0 for all of them, and the ReLU derivative is 0 for all of them. That neuron's incoming weights receive zero gradient — they never update. This "dead neuron" can happen after a large weight update overshoots. Leaky ReLU (which passes a small fraction of negative values through) avoids permanent death by keeping a non-zero gradient in the negative region.


Backpropagation assumes you can compute partial derivatives of the loss with respect to each operation in the network. That requires the chain rule of derivatives, which handles compositions of functions — covered in the next post. The loss function used here (binary cross-entropy) and the activations (sigmoid, ReLU) were introduced in the ANN post; their gradient formulas appear here in context. The weight update rule applied here is vanilla gradient descent. The optimizers series (gradient descent, SGD, momentum, Adam) builds on exactly this update rule, adding momentum terms, adaptive learning rates, and per-parameter scaling. Regularization techniques like dropout interact directly with backprop: dropout randomly zeros activations during the forward pass, which zeros the corresponding gradients in the backward pass, preventing any single neuron from dominating.


Test Your Understanding

  1. The combined gradient ∂L/∂z2 = ŷ − y arises because BCE and sigmoid are used together. If the output activation were linear instead of sigmoid (so z2 = ŷ directly), what would ∂L/∂z2 look like for the BCE loss?

  2. In the trace above, W1₂₁ received a negative gradient (−0.107) and its value was increased after the update. Explain in terms of the network's prediction error why increasing this weight is the correct direction.

  3. Suppose hidden neuron 2's pre-activation Z1₂ had been −0.05 instead of +0.13. Walk through what happens to δ1₂, ∂L/∂W1₂₁, and ∂L/∂W1₂₂. Does the label y = 0 affect this outcome?

  4. The code averages gradients across 5 samples before updating. If you instead updated after every sample (pure online learning), in what scenario would that actually converge faster than mini-batch averaging?

  5. A network has 50 hidden layers, all using sigmoid activation. After backpropagating through all 50 layers, estimate the order of magnitude of the gradient reaching layer 1, given that each sigmoid layer contributes at most a factor of 0.25 to the gradient. What does this imply for training the earliest layers?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment