← View series: machine learning
~/blog
Gradient Descent
The OLS closed form gives you the exact answer in one shot — but only for linear regression with MSE. The moment you change the loss function or the model, there's no algebraic shortcut. Gradient descent is the algorithm that works everywhere: logistic regression, neural networks, XGBoost internals. Understanding it on linear regression — where you can verify the result against OLS — is the right place to build the intuition.
The Gradient of MSE
The gradient is the vector of partial derivatives:
The gradient points in the direction of steepest ascent. We subtract it to go downhill:
where is the learning rate — how big a step we take each iteration.
Anchor dataset: , .
Gradient Descent on Unscaled Data — The Scaling Problem
Start with , , :
Iteration 1:
- Predictions:
- Residuals:
- Update: ← already exploding
is too large for unscaled data. The scale of (650–1900) makes the gradient orders of magnitude larger than the gradient. This is why feature scaling is not optional — it's structurally required for gradient descent to work efficiently.
Feature Scaling Before Gradient Descent
from sklearn.preprocessing import StandardScaler
import numpy as np
X_raw = np.array([650, 850, 1100, 1400, 1600, 1900]).reshape(-1, 1)
y_raw = np.array([180, 220, 280, 340, 370, 430]).reshape(-1, 1)
scaler_X = StandardScaler()
scaler_y = StandardScaler()
X_scaled = scaler_X.fit_transform(X_raw).flatten()
y_scaled = scaler_y.fit_transform(y_raw).flatten()
print("X_scaled:", X_scaled.round(3))
print("y_scaled:", y_scaled.round(3))X_scaled: [-1.414 -0.943 -0.314 0.314 0.628 1.257]
y_scaled: [-1.414 -0.943 -0.314 0.314 0.628 1.257]
After scaling, both and have mean 0 and standard deviation 1. The gradients are balanced, and works smoothly.
Manual Gradient Descent on Scaled Data — 4 Iterations
Start: , , :
| Iter | MSE | ||
|---|---|---|---|
| 0 | 0.0000 | 0.0000 | 1.0000 |
| 1 | 0.0000 | 0.3000 | 0.4200 |
| 2 | 0.0000 | 0.5460 | 0.1992 |
| 3 | 0.0000 | 0.7322 | 0.0994 |
| 4 | 0.0000 | 0.8625 | 0.0526 |
stays at 0 because — the scaled target is already centered. converges toward 1.0 in scaled space, which corresponds to in original space (after unscaling).
<text x="290" y="268" text-anchor="middle" font-size="12" fill="#334155">Iteration</text>
<text x="18" y="130" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,18,130)">MSE</text>
<text x="65" y="253" font-size="10" fill="#64748b">0</text>
<text x="155" y="253" font-size="10" fill="#64748b">20</text>
<text x="255" y="253" font-size="10" fill="#64748b">50</text>
<text x="355" y="253" font-size="10" fill="#64748b">100</text>
<text x="455" y="253" font-size="10" fill="#64748b">200</text>
<path d="M65,38 Q100,60 140,100 Q200,150 280,200 Q360,225 440,235 Q480,238 510,239" fill="none" stroke="#3b82f6" stroke-width="2"/>
<circle cx="65" cy="38" r="4" fill="#f59e0b"/>
<text x="72" y="42" font-size="9" fill="#f59e0b">MSE=1.0 (start)</text>
<circle cx="510" cy="239" r="4" fill="#22c55e"/>
<text x="420" y="232" font-size="9" fill="#22c55e">converged ≈ 0</text>
The Three Variants of Gradient Descent
| Variant | Update Uses | Pros | Cons |
|---|---|---|---|
| Batch GD | All samples per step | Smooth convergence, accurate gradient | Slow on large |
| Stochastic GD (SGD) | 1 random sample per step | Fast per step, can escape shallow minima | Noisy, oscillates |
| Mini-batch GD | samples per step () | Balanced speed and stability | Extra hyperparameter |
For our 6-sample anchor, all three give the same final weights — the difference matters at where batch GD requires computing gradients over all 1M samples each step.
SGD Step Trace — 1 Sample
On the same unscaled anchor, , randomly pick sample 3 (, ):
With initial , : ,
One sample gives a noisier gradient than the full batch — but costs the computation. The noise averages out over many iterations.
Learning Rate Sensitivity
<text x="290" y="268" text-anchor="middle" font-size="12" fill="#334155">Iteration</text>
<text x="18" y="130" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,18,130)">MSE</text>
<path d="M65,38 Q200,36 350,34 Q430,33 510,32" fill="none" stroke="#94a3b8" stroke-width="2" stroke-dasharray="5,3"/>
<text x="355" y="30" font-size="10" fill="#94a3b8">α=0.000001 (too small)</text>
<path d="M65,38 Q100,80 150,140 Q230,195 340,228 Q430,237 510,239" fill="none" stroke="#22c55e" stroke-width="2"/>
<text x="420" y="235" font-size="10" fill="#22c55e">α=0.01 (good)</text>
<path d="M65,38 Q80,230 95,50 Q110,230 125,50 Q140,230 155,50 Q200,100 250,80 Q350,60 510,50" fill="none" stroke="#dc2626" stroke-width="1.5"/>
<text x="250" y="73" font-size="10" fill="#dc2626">α=0.5 (oscillates)</text>
- too small: the curve barely moves over 200 iterations — slow but stable.
- right: rapid descent, converges cleanly around 50–100 iterations.
- too large: MSE oscillates up and down, potentially diverging — the step overshoots the minimum.
Code Implementation
def gradient_descent(X, y, alpha=0.1, n_iter=200):
n = len(y)
w0, w1 = 0.0, 0.0
history = []
for _ in range(n_iter):
y_pred = w0 + w1 * X
error = y - y_pred
dw0 = -(2 / n) * error.sum()
dw1 = -(2 / n) * (X * error).sum()
w0 -= alpha * dw0
w1 -= alpha * dw1
history.append((error ** 2).mean())
return w0, w1, history
w0_s, w1_s, hist = gradient_descent(X_scaled, y_scaled, alpha=0.1, n_iter=200)
print(f"w₀={w0_s:.4f}, w₁={w1_s:.4f}")
print(f"Final MSE: {hist[-1]:.6f}")w₀=0.0000, w₁=0.9998
Final MSE: 0.000001
Convergence Criteria
Gradient descent stops when one of three conditions is met:
- Loss change < threshold:
- Weight change < threshold:
- Max iterations reached — safety stop to prevent infinite loops
Gradient Descent Cheat Sheet
| Step | Action |
|---|---|
| 1 | Initialize , |
| 2 | Compute predictions |
| 3 | Compute residuals |
| 4 | Compute gradients: and |
| 5 | Update: |
| 6 | Repeat until convergence |
Related Concepts and Honest Limitations
For linear regression, gradient descent and OLS always arrive at the same weights (both find the unique global minimum). The difference is practical: OLS is — impractical when or is large. Gradient descent with mini-batches scales to millions of samples and thousands of features.
The honest limitation of gradient descent is its sensitivity to hyperparameters: learning rate , the number of iterations, and the batch size all require tuning. The learning rate in particular needs to be scaled to the data — which is why feature scaling isn't a preprocessing convenience but a prerequisite for gradient descent to work at reasonable values.
Test Your Understanding
-
For the anchor data, compute the gradient at , manually. Confirm the sign: should gradient descent increase or decrease from here?
-
After one mini-batch gradient descent step with batch size 3 (samples 1, 3, 5 of the anchor), how does the update differ from a full-batch step? Which samples were excluded and what bias does that introduce?
-
If you scale only but not , will gradient descent still converge? Will the learned correspond to the correct unscaled coefficient after inverse-transforming?
-
SGD is described as "noisier" than batch gradient descent. Under what conditions is that noise actually beneficial?
-
The learning rate caused oscillation on our anchor. Derive the exact maximum stable learning rate for gradient descent on MSE using the Hessian eigenvalue bound .