~/blog
Gradient Descent
Training a neural network is an optimization problem. You have a cost function J(w, b) — a surface defined over all model parameters. You want to find the lowest point on that surface. Gradient descent is the algorithm that does it: compute the slope at the current position, take a step downhill, repeat.
The idea is simple enough that it sounds obvious in hindsight. What makes it non-trivial is choosing the step size (learning rate), knowing when to stop, and understanding why the naive version fails at scale — which is why the next five posts build on top of it.
Anchor: predict salary from years of experience.
X = [1, 2, 4, 6, 8] # years experience
y = [40000, 50000, 70000, 90000, 110000] # salary (dollars)Model: ŷ = w·x + b — single weight, single bias. Normalize before training.
The Optimization Problem
The cost function is MSE over all 5 training samples:
J(w, b) = (1/n) Σ (ŷᵢ − yᵢ)²
J is a function of w and b. For our linear model, this surface is a bowl — one global minimum. For neural networks, the surface is high-dimensional and non-convex, but the same algorithm applies.
The Update Rule
Why subtract the gradient? The gradient ∂J/∂w points in the direction of steepest ascent — the direction that increases J. To minimize J, we move in the opposite direction.
w ← w − η · ∂J/∂w
b ← b − η · ∂J/∂b
η (eta) is the learning rate — the step size. Too large: we overshoot the minimum. Too small: convergence is slow. Choosing η is the most critical hyperparameter in training.
Full Epoch 1 Computation
Normalize the data first:
X_n = [-1.414, -0.707, 0.000, 0.707, 1.414]
y_n = [-1.414, -0.707, 0.000, 0.707, 1.414]Step 1 — Forward pass (w=0, b=0): ŷ = w·x + b = 0 for all samples → ŷ = [0, 0, 0, 0, 0]
Step 2 — Cost (MSE):
J = (1/5) × [(0−(−1.414))² + (0−(−0.707))² + (0−0)² + (0−0.707)² + (0−1.414)²] = (1/5) × [2.0 + 0.5 + 0 + 0.5 + 2.0] = (1/5) × 5.0 = 1.000
Step 3 — Gradients:
∂J/∂w = (2/n) Σ (ŷᵢ − yᵢ)·xᵢ
= (2/5) × [(0−(−1.414))(−1.414) + (0−(−0.707))(−0.707) + 0 + (0−0.707)(0.707) + (0−1.414)(1.414)] = (2/5) × [(1.414)(−1.414) + (0.707)(−0.707) + 0 + (−0.707)(0.707) + (−1.414)(1.414)] = (2/5) × [−2.0 − 0.5 + 0 − 0.5 − 2.0] = (2/5) × (−5.0) = −2.000
∂J/∂b = (2/n) Σ (ŷᵢ − yᵢ) = (2/5) × [1.414 + 0.707 + 0 − 0.707 − 1.414] = 0.000
Step 4 — Update (η = 0.1):
w_new = 0 − 0.1 × (−2.000) = 0.200 b_new = 0 − 0.1 × 0.000 = 0.000
The weight moves from 0 to 0.2 in the first step — the model starts learning the slope.
Types of Gradient Descent
This post covers full-batch gradient descent: every sample is used to compute the gradient before each weight update. The next posts cover the variants.
| Type | Samples per update | Pros | Cons |
|---|---|---|---|
| Batch GD | All n | Stable, exact gradient | Slow per epoch, high memory |
| SGD | 1 | Fast, online learning possible | Noisy gradient, high variance |
| Mini-batch SGD | k (32–256) | Balanced — GPU-efficient | Batch size hyperparameter |
Learning Rate Sensitivity
- η = 0.001: the cost decreases but very slowly. After 100 epochs it is still near its starting value.
- η = 0.1: clean descent toward the minimum.
- η = 1.0: the update overshoots the minimum each step, bouncing to a higher cost. If the learning rate is large enough, this oscillation grows and the cost diverges.
Convergence Criteria
Training stops when one of three conditions is met:
- Tolerance: |J(t) − J(t−1)| < ε (cost barely changes between steps — at the flat bottom)
- Max iterations: training has run for max_epochs regardless of convergence
- Gradient norm: ||∇J||₂ < threshold (gradient is small — we are near a stationary point)
In practice, frameworks use max_epochs as the primary stopping criterion and monitor the validation loss separately to detect overfitting.
Trace Table
| Epoch | w | b | J | ∂J/∂w | ∂J/∂b |
|---|---|---|---|---|---|
| 0 | 0.0000 | 0.0000 | 1.0000 | −2.0000 | 0.0000 |
| 1 | 0.2000 | 0.0000 | 0.6000 | −1.5492 | 0.0000 |
| 2 | 0.3549 | 0.0000 | 0.3601 | −1.1985 | 0.0000 |
| 3 | 0.4748 | 0.0000 | 0.2161 | −0.9278 | 0.0000 |
| 4 | 0.5675 | 0.0000 | 0.1297 | −0.7183 | 0.0000 |
The bias stays 0 because the data is symmetric and zero-centered after normalization — ∂J/∂b = (2/n)Σ(ŷ−y) sums to zero when the predictions are symmetric around zero.
Code
import numpy as np
X = np.array([1, 2, 4, 6, 8], dtype=float)
y = np.array([40000, 50000, 70000, 90000, 110000], dtype=float)
X_n = (X - X.mean()) / X.std()
y_n = (y - y.mean()) / y.std()
w, b, lr = 0.0, 0.0, 0.1
for epoch in range(5):
y_hat = w * X_n + b
loss = np.mean((y_hat - y_n) ** 2)
dw = (2 / len(y_n)) * np.sum((y_hat - y_n) * X_n)
db = (2 / len(y_n)) * np.sum(y_hat - y_n)
w -= lr * dw
b -= lr * db
print(f"Epoch {epoch+1}: w={w:.4f}, b={b:.4f}, loss={loss:.4f}, dw={dw:.4f}")Epoch 1: w=0.2000, b=0.0000, loss=1.0000, dw=-2.0000
Epoch 2: w=0.3549, b=0.0000, loss=0.6000, dw=-1.5492
Epoch 3: w=0.4748, b=0.0000, loss=0.3601, dw=-1.1985
Epoch 4: w=0.5675, b=0.0000, loss=0.2161, dw=-0.9278
Epoch 5: w=0.6393, b=0.0000, loss=0.1297, dw=-0.7183w converges toward 1.0 — the normalized slope of the salary vs experience relationship. Each epoch, the gradient magnitude (dw column) shrinks because the model is getting closer to the minimum and the slope of the loss surface is flattening.
Related Concepts
Where this builds from: The cost function and loss/cost distinction (section 4, post 01) explained what J is. Backpropagation and the chain rule (section 2, posts 04–05) showed how ∂J/∂w is computed for neural networks. Gradient descent is the algorithm that uses those gradients.
Where this leads: SGD (next post) processes one sample at a time. Mini-batch SGD processes k samples. Momentum, Adagrad, RMSProp, and Adam each modify how the gradient is used in the update step — not the gradient computation itself, but how the step is taken.
Honest Limitations
Batch gradient descent does not scale. With 1 million training samples, computing ∂J/∂w requires a full pass over all samples before a single weight update. At 100 epochs, that is 100 million sample evaluations before convergence. Mini-batch SGD is used in practice precisely because it amortizes the cost.
The cost landscape for neural networks is not a bowl. The convex quadratic surface in this post is specific to linear regression with MSE. Neural networks have non-convex, saddle-point-dense surfaces. Gradient descent converges to a local minimum, not necessarily the global one. In practice, large models have so many parameters that saddle points and flat regions are bigger concerns than local minima.
Gradient descent requires differentiable loss. MAE is non-differentiable at zero. Hinge loss is non-differentiable at the margin boundary. Subgradient methods or smooth approximations (Huber) are required for these losses.
Test Your Understanding
-
The gradient ∂J/∂w = −2.000 after epoch 0 with w=0, b=0. Without running any code, predict the sign of ∂J/∂w after epoch 1 (w=0.2). Explain why the magnitude shrinks and why the sign is still negative.
-
In the trace table, ∂J/∂b = 0 throughout training. Prove that this must be zero for any w and b when X_n and y_n are zero-centered. Use the formula ∂J/∂b = (2/n)Σ(ŷᵢ − yᵢ).
-
Learning rate η = 1.0 causes divergence on this dataset. Derive the maximum stable learning rate for gradient descent on MSE with n=5 samples and the given normalized X_n values. Hint: the update rule is stable when |1 − η × (2/n) × Σxᵢ²| < 1.
-
If batch gradient descent takes 1 second per epoch and needs 1000 epochs to converge, and SGD (batch size=1) takes 0.001 seconds per sample and takes 5000 epochs to converge (each epoch = 5 samples = 5 updates), which is faster in wall-clock time? What assumption is hidden in this comparison?
-
Consider a non-convex loss surface with one local minimum at J=0.5 and one global minimum at J=0.1. Starting from a random point, standard gradient descent converges to one of them. What factor determines which minimum it converges to, and how does the learning rate affect this?