~/blog

Gradient Descent

Jul 1, 2026•8 min read•By Mohammed Vasim

deep-learningneural-networksmachine-learningrepresentation-learning

Training a neural network is an optimization problem. You have a cost function J(w, b) — a surface defined over all model parameters. You want to find the lowest point on that surface. Gradient descent is the algorithm that does it: compute the slope at the current position, take a step downhill, repeat.

The idea is simple enough that it sounds obvious in hindsight. What makes it non-trivial is choosing the step size (learning rate), knowing when to stop, and understanding why the naive version fails at scale — which is why the next five posts build on top of it.

Anchor: predict salary from years of experience.

python

X = [1, 2, 4, 6, 8]  # years experience
y = [40000, 50000, 70000, 90000, 110000]  # salary (dollars)

Model: ŷ = w·x + b — single weight, single bias. Normalize before training.

The Optimization Problem

The cost function is MSE over all 5 training samples:

J(w, b) = (1/n) Σ (ŷᵢ − yᵢ)²

J is a function of w and b. For our linear model, this surface is a bowl — one global minimum. For neural networks, the surface is high-dimensional and non-convex, but the same algorithm applies.

The Update Rule

Why subtract the gradient? The gradient ∂J/∂w points in the direction of steepest ascent — the direction that increases J. To minimize J, we move in the opposite direction.

w ← w − η · ∂J/∂w

b ← b − η · ∂J/∂b

η (eta) is the learning rate — the step size. Too large: we overshoot the minimum. Too small: convergence is slow. Choosing η is the most critical hyperparameter in training.

Full Epoch 1 Computation

Normalize the data first:

text

X_n = [-1.414, -0.707, 0.000, 0.707, 1.414]
y_n = [-1.414, -0.707, 0.000, 0.707, 1.414]

Step 1 — Forward pass (w=0, b=0): ŷ = w·x + b = 0 for all samples → ŷ = [0, 0, 0, 0, 0]

Step 2 — Cost (MSE):

J = (1/5) × [(0−(−1.414))² + (0−(−0.707))² + (0−0)² + (0−0.707)² + (0−1.414)²] = (1/5) × [2.0 + 0.5 + 0 + 0.5 + 2.0] = (1/5) × 5.0 = 1.000

Step 3 — Gradients:

∂J/∂w = (2/n) Σ (ŷᵢ − yᵢ)·xᵢ

= (2/5) × [(0−(−1.414))(−1.414) + (0−(−0.707))(−0.707) + 0 + (0−0.707)(0.707) + (0−1.414)(1.414)] = (2/5) × [(1.414)(−1.414) + (0.707)(−0.707) + 0 + (−0.707)(0.707) + (−1.414)(1.414)] = (2/5) × [−2.0 − 0.5 + 0 − 0.5 − 2.0] = (2/5) × (−5.0) = −2.000

∂J/∂b = (2/n) Σ (ŷᵢ − yᵢ) = (2/5) × [1.414 + 0.707 + 0 − 0.707 − 1.414] = 0.000

Step 4 — Update (η = 0.1):

w_new = 0 − 0.1 × (−2.000) = 0.200 b_new = 0 − 0.1 × 0.000 = 0.000

The weight moves from 0 to 0.2 in the first step — the model starts learning the slope.

Types of Gradient Descent

This post covers full-batch gradient descent: every sample is used to compute the gradient before each weight update. The next posts cover the variants.

Type	Samples per update	Pros	Cons
Batch GD	All n	Stable, exact gradient	Slow per epoch, high memory
SGD	1	Fast, online learning possible	Noisy gradient, high variance
Mini-batch SGD	k (32–256)	Balanced — GPU-efficient	Batch size hyperparameter

Learning Rate Sensitivity

η = 0.001: the cost decreases but very slowly. After 100 epochs it is still near its starting value.
η = 0.1: clean descent toward the minimum.
η = 1.0: the update overshoots the minimum each step, bouncing to a higher cost. If the learning rate is large enough, this oscillation grows and the cost diverges.

Convergence Criteria

Training stops when one of three conditions is met:

Tolerance: |J(t) − J(t−1)| < ε (cost barely changes between steps — at the flat bottom)
Max iterations: training has run for max_epochs regardless of convergence
Gradient norm: ||∇J||₂ < threshold (gradient is small — we are near a stationary point)

In practice, frameworks use max_epochs as the primary stopping criterion and monitor the validation loss separately to detect overfitting.

Trace Table

Epoch	w	J	∂J/∂w
0	0.0000	1.0000	−2.0000
1	0.2000	0.6000	−1.5492
2	0.3549	0.3601	−1.1985
3	0.4748	0.2161	−0.9278
4	0.5675	0.1297	−0.7183

The bias stays 0 because the data is symmetric and zero-centered after normalization — ∂J/∂b = (2/n)Σ(ŷ−y) sums to zero when the predictions are symmetric around zero.

Code

python

import numpy as np

X = np.array([1, 2, 4, 6, 8], dtype=float)
y = np.array([40000, 50000, 70000, 90000, 110000], dtype=float)
X_n = (X - X.mean()) / X.std()
y_n = (y - y.mean()) / y.std()

w, b, lr = 0.0, 0.0, 0.1

for epoch in range(5):
    y_hat = w * X_n + b
    loss  = np.mean((y_hat - y_n) ** 2)
    dw    = (2 / len(y_n)) * np.sum((y_hat - y_n) * X_n)
    db    = (2 / len(y_n)) * np.sum(y_hat - y_n)
    w -= lr * dw
    b -= lr * db
    print(f"Epoch {epoch+1}: w={w:.4f}, b={b:.4f}, loss={loss:.4f}, dw={dw:.4f}")

text

Epoch 1: w=0.2000, b=0.0000, loss=1.0000, dw=-2.0000
Epoch 2: w=0.3549, b=0.0000, loss=0.6000, dw=-1.5492
Epoch 3: w=0.4748, b=0.0000, loss=0.3601, dw=-1.1985
Epoch 4: w=0.5675, b=0.0000, loss=0.2161, dw=-0.9278
Epoch 5: w=0.6393, b=0.0000, loss=0.1297, dw=-0.7183

w converges toward 1.0 — the normalized slope of the salary vs experience relationship. Each epoch, the gradient magnitude (dw column) shrinks because the model is getting closer to the minimum and the slope of the loss surface is flattening.

Where this builds from: The cost function and loss/cost distinction (section 4, post 01) explained what J is. Backpropagation and the chain rule (section 2, posts 04–05) showed how ∂J/∂w is computed for neural networks. Gradient descent is the algorithm that uses those gradients.

Where this leads: SGD (next post) processes one sample at a time. Mini-batch SGD processes k samples. Momentum, Adagrad, RMSProp, and Adam each modify how the gradient is used in the update step — not the gradient computation itself, but how the step is taken.

Honest Limitations

Batch gradient descent does not scale. With 1 million training samples, computing ∂J/∂w requires a full pass over all samples before a single weight update. At 100 epochs, that is 100 million sample evaluations before convergence. Mini-batch SGD is used in practice precisely because it amortizes the cost.

The cost landscape for neural networks is not a bowl. The convex quadratic surface in this post is specific to linear regression with MSE. Neural networks have non-convex, saddle-point-dense surfaces. Gradient descent converges to a local minimum, not necessarily the global one. In practice, large models have so many parameters that saddle points and flat regions are bigger concerns than local minima.

Gradient descent requires differentiable loss. MAE is non-differentiable at zero. Hinge loss is non-differentiable at the margin boundary. Subgradient methods or smooth approximations (Huber) are required for these losses.

Test Your Understanding

The gradient ∂J/∂w = −2.000 after epoch 0 with w=0, b=0. Without running any code, predict the sign of ∂J/∂w after epoch 1 (w=0.2). Explain why the magnitude shrinks and why the sign is still negative.
In the trace table, ∂J/∂b = 0 throughout training. Prove that this must be zero for any w and b when X_n and y_n are zero-centered. Use the formula ∂J/∂b = (2/n)Σ(ŷᵢ − yᵢ).
Learning rate η = 1.0 causes divergence on this dataset. Derive the maximum stable learning rate for gradient descent on MSE with n=5 samples and the given normalized X_n values. Hint: the update rule is stable when |1 − η × (2/n) × Σxᵢ²| < 1.
If batch gradient descent takes 1 second per epoch and needs 1000 epochs to converge, and SGD (batch size=1) takes 0.001 seconds per sample and takes 5000 epochs to converge (each epoch = 5 samples = 5 updates), which is faster in wall-clock time? What assumption is hidden in this comparison?
Consider a non-convex loss surface with one local minimum at J=0.5 and one global minimum at J=0.1. Starting from a random point, standard gradient descent converges to one of them. What factor determines which minimum it converges to, and how does the learning rate affect this?

Gradient Descent

The Optimization Problem

The Update Rule

Full Epoch 1 Computation

Types of Gradient Descent

Learning Rate Sensitivity

Convergence Criteria

Trace Table

Code

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

Gradient Descent

The Optimization Problem

The Update Rule

Full Epoch 1 Computation

Types of Gradient Descent

Learning Rate Sensitivity

Convergence Criteria

Trace Table

Code

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment