~/blog

Gradient Descent

Jul 1, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Training a neural network is an optimization problem. You have a cost function J(w, b) — a surface defined over all model parameters. You want to find the lowest point on that surface. Gradient descent is the algorithm that does it: compute the slope at the current position, take a step downhill, repeat.

The idea is simple enough that it sounds obvious in hindsight. What makes it non-trivial is choosing the step size (learning rate), knowing when to stop, and understanding why the naive version fails at scale — which is why the next five posts build on top of it.

Anchor: predict salary from years of experience.

python
X = [1, 2, 4, 6, 8]  # years experience
y = [40000, 50000, 70000, 90000, 110000]  # salary (dollars)

Model: ŷ = w·x + b — single weight, single bias. Normalize before training.


The Optimization Problem

The cost function is MSE over all 5 training samples:

J(w, b) = (1/n) Σ (ŷᵢ − yᵢ)²

J is a function of w and b. For our linear model, this surface is a bowl — one global minimum. For neural networks, the surface is high-dimensional and non-convex, but the same algorithm applies.

Loss Surface J(w,b) — Descend Toward Minimum minimum start step 1: −η·∇J Each step: w ← w − η · ∂J/∂w. Gradient points uphill; we subtract to go downhill. w axis → J(w,b)

The Update Rule

Why subtract the gradient? The gradient ∂J/∂w points in the direction of steepest ascent — the direction that increases J. To minimize J, we move in the opposite direction.

w ← w − η · ∂J/∂w

b ← b − η · ∂J/∂b

η (eta) is the learning rate — the step size. Too large: we overshoot the minimum. Too small: convergence is slow. Choosing η is the most critical hyperparameter in training.


Full Epoch 1 Computation

Normalize the data first:

text
X_n = [-1.414, -0.707, 0.000, 0.707, 1.414]
y_n = [-1.414, -0.707, 0.000, 0.707, 1.414]

Step 1 — Forward pass (w=0, b=0): ŷ = w·x + b = 0 for all samples → ŷ = [0, 0, 0, 0, 0]

Step 2 — Cost (MSE):

J = (1/5) × [(0−(−1.414))² + (0−(−0.707))² + (0−0)² + (0−0.707)² + (0−1.414)²] = (1/5) × [2.0 + 0.5 + 0 + 0.5 + 2.0] = (1/5) × 5.0 = 1.000

Step 3 — Gradients:

∂J/∂w = (2/n) Σ (ŷᵢ − yᵢ)·xᵢ

= (2/5) × [(0−(−1.414))(−1.414) + (0−(−0.707))(−0.707) + 0 + (0−0.707)(0.707) + (0−1.414)(1.414)] = (2/5) × [(1.414)(−1.414) + (0.707)(−0.707) + 0 + (−0.707)(0.707) + (−1.414)(1.414)] = (2/5) × [−2.0 − 0.5 + 0 − 0.5 − 2.0] = (2/5) × (−5.0) = −2.000

∂J/∂b = (2/n) Σ (ŷᵢ − yᵢ) = (2/5) × [1.414 + 0.707 + 0 − 0.707 − 1.414] = 0.000

Step 4 — Update (η = 0.1):

w_new = 0 − 0.1 × (−2.000) = 0.200 b_new = 0 − 0.1 × 0.000 = 0.000

The weight moves from 0 to 0.2 in the first step — the model starts learning the slope.

Gradient Flow — Epoch 1 (w=0, b=0) MSE = 1.000 ∂J/∂w = −2.000 ∂J/∂b = 0.000 w ← 0 − 0.1×(−2)= 0.2 b ← 0 − 0.1×0 = 0.0 Loss computed over all 5 samples Chain rule: ∂J/∂w = (2/n)Σ(ŷ−y)x η = 0.1 Summary: gradient −2.0 on w means increasing w increases J — wrong direction. Subtract gradient × lr → w grows from 0 to 0.2 — the model learns the positive slope.

Types of Gradient Descent

This post covers full-batch gradient descent: every sample is used to compute the gradient before each weight update. The next posts cover the variants.

TypeSamples per updateProsCons
Batch GDAll nStable, exact gradientSlow per epoch, high memory
SGD1Fast, online learning possibleNoisy gradient, high variance
Mini-batch SGDk (32–256)Balanced — GPU-efficientBatch size hyperparameter

Learning Rate Sensitivity

Learning Rate Effect — The Goldilocks Zone Epochs Cost J 0 100 η=0.001 (too slow) η=0.1 (good) η=1.0 (diverges)
  • η = 0.001: the cost decreases but very slowly. After 100 epochs it is still near its starting value.
  • η = 0.1: clean descent toward the minimum.
  • η = 1.0: the update overshoots the minimum each step, bouncing to a higher cost. If the learning rate is large enough, this oscillation grows and the cost diverges.

Convergence Criteria

Training stops when one of three conditions is met:

  1. Tolerance: |J(t) − J(t−1)| < ε (cost barely changes between steps — at the flat bottom)
  2. Max iterations: training has run for max_epochs regardless of convergence
  3. Gradient norm: ||∇J||₂ < threshold (gradient is small — we are near a stationary point)

In practice, frameworks use max_epochs as the primary stopping criterion and monitor the validation loss separately to detect overfitting.


Trace Table

EpochwbJ∂J/∂w∂J/∂b
00.00000.00001.0000−2.00000.0000
10.20000.00000.6000−1.54920.0000
20.35490.00000.3601−1.19850.0000
30.47480.00000.2161−0.92780.0000
40.56750.00000.1297−0.71830.0000

The bias stays 0 because the data is symmetric and zero-centered after normalization — ∂J/∂b = (2/n)Σ(ŷ−y) sums to zero when the predictions are symmetric around zero.


Code

python
import numpy as np

X = np.array([1, 2, 4, 6, 8], dtype=float)
y = np.array([40000, 50000, 70000, 90000, 110000], dtype=float)
X_n = (X - X.mean()) / X.std()
y_n = (y - y.mean()) / y.std()

w, b, lr = 0.0, 0.0, 0.1

for epoch in range(5):
    y_hat = w * X_n + b
    loss  = np.mean((y_hat - y_n) ** 2)
    dw    = (2 / len(y_n)) * np.sum((y_hat - y_n) * X_n)
    db    = (2 / len(y_n)) * np.sum(y_hat - y_n)
    w -= lr * dw
    b -= lr * db
    print(f"Epoch {epoch+1}: w={w:.4f}, b={b:.4f}, loss={loss:.4f}, dw={dw:.4f}")
text
Epoch 1: w=0.2000, b=0.0000, loss=1.0000, dw=-2.0000
Epoch 2: w=0.3549, b=0.0000, loss=0.6000, dw=-1.5492
Epoch 3: w=0.4748, b=0.0000, loss=0.3601, dw=-1.1985
Epoch 4: w=0.5675, b=0.0000, loss=0.2161, dw=-0.9278
Epoch 5: w=0.6393, b=0.0000, loss=0.1297, dw=-0.7183

w converges toward 1.0 — the normalized slope of the salary vs experience relationship. Each epoch, the gradient magnitude (dw column) shrinks because the model is getting closer to the minimum and the slope of the loss surface is flattening.


Where this builds from: The cost function and loss/cost distinction (section 4, post 01) explained what J is. Backpropagation and the chain rule (section 2, posts 04–05) showed how ∂J/∂w is computed for neural networks. Gradient descent is the algorithm that uses those gradients.

Where this leads: SGD (next post) processes one sample at a time. Mini-batch SGD processes k samples. Momentum, Adagrad, RMSProp, and Adam each modify how the gradient is used in the update step — not the gradient computation itself, but how the step is taken.


Honest Limitations

Batch gradient descent does not scale. With 1 million training samples, computing ∂J/∂w requires a full pass over all samples before a single weight update. At 100 epochs, that is 100 million sample evaluations before convergence. Mini-batch SGD is used in practice precisely because it amortizes the cost.

The cost landscape for neural networks is not a bowl. The convex quadratic surface in this post is specific to linear regression with MSE. Neural networks have non-convex, saddle-point-dense surfaces. Gradient descent converges to a local minimum, not necessarily the global one. In practice, large models have so many parameters that saddle points and flat regions are bigger concerns than local minima.

Gradient descent requires differentiable loss. MAE is non-differentiable at zero. Hinge loss is non-differentiable at the margin boundary. Subgradient methods or smooth approximations (Huber) are required for these losses.


Test Your Understanding

  1. The gradient ∂J/∂w = −2.000 after epoch 0 with w=0, b=0. Without running any code, predict the sign of ∂J/∂w after epoch 1 (w=0.2). Explain why the magnitude shrinks and why the sign is still negative.

  2. In the trace table, ∂J/∂b = 0 throughout training. Prove that this must be zero for any w and b when X_n and y_n are zero-centered. Use the formula ∂J/∂b = (2/n)Σ(ŷᵢ − yᵢ).

  3. Learning rate η = 1.0 causes divergence on this dataset. Derive the maximum stable learning rate for gradient descent on MSE with n=5 samples and the given normalized X_n values. Hint: the update rule is stable when |1 − η × (2/n) × Σxᵢ²| < 1.

  4. If batch gradient descent takes 1 second per epoch and needs 1000 epochs to converge, and SGD (batch size=1) takes 0.001 seconds per sample and takes 5000 epochs to converge (each epoch = 5 samples = 5 updates), which is faster in wall-clock time? What assumption is hidden in this comparison?

  5. Consider a non-convex loss surface with one local minimum at J=0.5 and one global minimum at J=0.1. Starting from a random point, standard gradient descent converges to one of them. What factor determines which minimum it converges to, and how does the learning rate affect this?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment