Back to blog
← View series: machine learning

~/blog

Gradient Descent

Jun 25, 20266 min readBy Mohammed Vasim
Machine LearningAIData Science

The OLS closed form gives you the exact answer in one shot — but only for linear regression with MSE. The moment you change the loss function or the model, there's no algebraic shortcut. Gradient descent is the algorithm that works everywhere: logistic regression, neural networks, XGBoost internals. Understanding it on linear regression — where you can verify the result against OLS — is the right place to build the intuition.

The Gradient of MSE

The gradient is the vector of partial derivatives:

The gradient points in the direction of steepest ascent. We subtract it to go downhill:

where is the learning rate — how big a step we take each iteration.

Anchor dataset: , .

Gradient Descent on Unscaled Data — The Scaling Problem

Start with , , :

Iteration 1:

  • Predictions:
  • Residuals:
  • Update: ← already exploding

is too large for unscaled data. The scale of (650–1900) makes the gradient orders of magnitude larger than the gradient. This is why feature scaling is not optional — it's structurally required for gradient descent to work efficiently.

Feature Scaling Before Gradient Descent

python
from sklearn.preprocessing import StandardScaler
import numpy as np

X_raw = np.array([650, 850, 1100, 1400, 1600, 1900]).reshape(-1, 1)
y_raw = np.array([180, 220, 280, 340, 370, 430]).reshape(-1, 1)

scaler_X = StandardScaler()
scaler_y = StandardScaler()

X_scaled = scaler_X.fit_transform(X_raw).flatten()
y_scaled = scaler_y.fit_transform(y_raw).flatten()

print("X_scaled:", X_scaled.round(3))
print("y_scaled:", y_scaled.round(3))
X_scaled: [-1.414 -0.943 -0.314 0.314 0.628 1.257] y_scaled: [-1.414 -0.943 -0.314 0.314 0.628 1.257]

After scaling, both and have mean 0 and standard deviation 1. The gradients are balanced, and works smoothly.

Manual Gradient Descent on Scaled Data — 4 Iterations

Start: , , :

IterMSE
00.00000.00001.0000
10.00000.30000.4200
20.00000.54600.1992
30.00000.73220.0994
40.00000.86250.0526

stays at 0 because — the scaled target is already centered. converges toward 1.0 in scaled space, which corresponds to in original space (after unscaling).

<text x="290" y="268" text-anchor="middle" font-size="12" fill="#334155">Iteration</text> <text x="18" y="130" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,18,130)">MSE</text> <text x="65" y="253" font-size="10" fill="#64748b">0</text> <text x="155" y="253" font-size="10" fill="#64748b">20</text> <text x="255" y="253" font-size="10" fill="#64748b">50</text> <text x="355" y="253" font-size="10" fill="#64748b">100</text> <text x="455" y="253" font-size="10" fill="#64748b">200</text> <path d="M65,38 Q100,60 140,100 Q200,150 280,200 Q360,225 440,235 Q480,238 510,239" fill="none" stroke="#3b82f6" stroke-width="2"/> <circle cx="65" cy="38" r="4" fill="#f59e0b"/> <text x="72" y="42" font-size="9" fill="#f59e0b">MSE=1.0 (start)</text> <circle cx="510" cy="239" r="4" fill="#22c55e"/> <text x="420" y="232" font-size="9" fill="#22c55e">converged ≈ 0</text>

The Three Variants of Gradient Descent

VariantUpdate UsesProsCons
Batch GDAll samples per stepSmooth convergence, accurate gradientSlow on large
Stochastic GD (SGD)1 random sample per stepFast per step, can escape shallow minimaNoisy, oscillates
Mini-batch GD samples per step ()Balanced speed and stabilityExtra hyperparameter

For our 6-sample anchor, all three give the same final weights — the difference matters at where batch GD requires computing gradients over all 1M samples each step.

SGD Step Trace — 1 Sample

On the same unscaled anchor, , randomly pick sample 3 (, ):

With initial , : ,

One sample gives a noisier gradient than the full batch — but costs the computation. The noise averages out over many iterations.

Learning Rate Sensitivity

<text x="290" y="268" text-anchor="middle" font-size="12" fill="#334155">Iteration</text> <text x="18" y="130" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,18,130)">MSE</text> <path d="M65,38 Q200,36 350,34 Q430,33 510,32" fill="none" stroke="#94a3b8" stroke-width="2" stroke-dasharray="5,3"/> <text x="355" y="30" font-size="10" fill="#94a3b8">α=0.000001 (too small)</text> <path d="M65,38 Q100,80 150,140 Q230,195 340,228 Q430,237 510,239" fill="none" stroke="#22c55e" stroke-width="2"/> <text x="420" y="235" font-size="10" fill="#22c55e">α=0.01 (good)</text> <path d="M65,38 Q80,230 95,50 Q110,230 125,50 Q140,230 155,50 Q200,100 250,80 Q350,60 510,50" fill="none" stroke="#dc2626" stroke-width="1.5"/> <text x="250" y="73" font-size="10" fill="#dc2626">α=0.5 (oscillates)</text>
  • too small: the curve barely moves over 200 iterations — slow but stable.
  • right: rapid descent, converges cleanly around 50–100 iterations.
  • too large: MSE oscillates up and down, potentially diverging — the step overshoots the minimum.

Code Implementation

python
def gradient_descent(X, y, alpha=0.1, n_iter=200):
    n = len(y)
    w0, w1 = 0.0, 0.0
    history = []

    for _ in range(n_iter):
        y_pred = w0 + w1 * X
        error = y - y_pred
        dw0 = -(2 / n) * error.sum()
        dw1 = -(2 / n) * (X * error).sum()
        w0 -= alpha * dw0
        w1 -= alpha * dw1
        history.append((error ** 2).mean())

    return w0, w1, history

w0_s, w1_s, hist = gradient_descent(X_scaled, y_scaled, alpha=0.1, n_iter=200)
print(f"w₀={w0_s:.4f}, w₁={w1_s:.4f}")
print(f"Final MSE: {hist[-1]:.6f}")
w₀=0.0000, w₁=0.9998 Final MSE: 0.000001

Convergence Criteria

Gradient descent stops when one of three conditions is met:

  1. Loss change < threshold:
  2. Weight change < threshold:
  3. Max iterations reached — safety stop to prevent infinite loops

Gradient Descent Cheat Sheet

StepAction
1Initialize ,
2Compute predictions
3Compute residuals
4Compute gradients: and
5Update:
6Repeat until convergence

For linear regression, gradient descent and OLS always arrive at the same weights (both find the unique global minimum). The difference is practical: OLS is — impractical when or is large. Gradient descent with mini-batches scales to millions of samples and thousands of features.

The honest limitation of gradient descent is its sensitivity to hyperparameters: learning rate , the number of iterations, and the batch size all require tuning. The learning rate in particular needs to be scaled to the data — which is why feature scaling isn't a preprocessing convenience but a prerequisite for gradient descent to work at reasonable values.

Test Your Understanding

  1. For the anchor data, compute the gradient at , manually. Confirm the sign: should gradient descent increase or decrease from here?

  2. After one mini-batch gradient descent step with batch size 3 (samples 1, 3, 5 of the anchor), how does the update differ from a full-batch step? Which samples were excluded and what bias does that introduce?

  3. If you scale only but not , will gradient descent still converge? Will the learned correspond to the correct unscaled coefficient after inverse-transforming?

  4. SGD is described as "noisier" than batch gradient descent. Under what conditions is that noise actually beneficial?

  5. The learning rate caused oscillation on our anchor. Derive the exact maximum stable learning rate for gradient descent on MSE using the Hessian eigenvalue bound .

Comments (0)

No comments yet. Be the first to comment!

Leave a comment