Gradient Descent

Machine LearningAIData Science

The OLS closed form gives you the exact answer in one shot — but only for linear regression with MSE. The moment you change the loss function or the model, there's no algebraic shortcut. Gradient descent is the algorithm that works everywhere: logistic regression, neural networks, XGBoost internals. Understanding it on linear regression — where you can verify the result against OLS — is the right place to build the intuition.

The Gradient of MSE

$J (w_{0}, w_{1}) = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - w_{0} - w_{1} x_{i})^{2}$

The gradient is the vector of partial derivatives:

$\frac{\partial J}{\partial w _{0}} = - \frac{2}{n} \sum_{i = 1}^{n} (y_{i} - w_{0} - w_{1} x_{i}) = - \frac{2}{n} \sum ε_{i}$

$\frac{\partial J}{\partial w _{1}} = - \frac{2}{n} \sum_{i = 1}^{n} x_{i} (y_{i} - w_{0} - w_{1} x_{i}) = - \frac{2}{n} \sum x_{i} ε_{i}$

The gradient points in the direction of steepest ascent. We subtract it to go downhill:

$w \leftarrow w - α \cdot \nabla J (w)$

where $α$ is the learning rate — how big a step we take each iteration.

Anchor dataset: $X = [650, 850, 1100, 1400, 1600, 1900]$ , $y = [180, 220, 280, 340, 370, 430]$ .

Gradient Descent on Unscaled Data — The Scaling Problem

Start with $w_{0} = 0$ , $w_{1} = 0$ , $α = 0.0001$ :

Iteration 1:

Predictions: $\overset{y}{^} = [0, 0, 0, 0, 0, 0]$
Residuals: $ε = [180, 220, 280, 340, 370, 430]$
$\partial J / \partial w_{0} = - (2/6) (180 + 220 + 280 + 340 + 370 + 430) = - (2/6) (1820) = - 606.67$
$\partial J / \partial w_{1} = - (2/6) (650 \times 180 + 850 \times 220 + 1100 \times 280 + 1400 \times 340 + 1600 \times 370 + 1900 \times 430)$ $= - (2/6) (117000 + 187000 + 308000 + 476000 + 592000 + 817000) = - (2/6) (2497000) = - 832333$
Update: $w_{1} = 0 - 0.0001 \times (- 832333) = 83.23$ ← already exploding

$α = 0.0001$ is too large for unscaled data. The scale of $x$ (650–1900) makes the $w_{1}$ gradient orders of magnitude larger than the $w_{0}$ gradient. This is why feature scaling is not optional — it's structurally required for gradient descent to work efficiently.

Feature Scaling Before Gradient Descent

python

from sklearn.preprocessing import StandardScaler
import numpy as np

X_raw = np.array([650, 850, 1100, 1400, 1600, 1900]).reshape(-1, 1)
y_raw = np.array([180, 220, 280, 340, 370, 430]).reshape(-1, 1)

scaler_X = StandardScaler()
scaler_y = StandardScaler()

X_scaled = scaler_X.fit_transform(X_raw).flatten()
y_scaled = scaler_y.fit_transform(y_raw).flatten()

print("X_scaled:", X_scaled.round(3))
print("y_scaled:", y_scaled.round(3))

X_scaled: [-1.414 -0.943 -0.314  0.314  0.628  1.257]
y_scaled: [-1.414 -0.943 -0.314  0.314  0.628  1.257]

After scaling, both $X$ and $y$ have mean 0 and standard deviation 1. The gradients are balanced, and $α = 0.1$ works smoothly.

Manual Gradient Descent on Scaled Data — 4 Iterations

Start: $w_{0} = 0.0$ , $w_{1} = 0.0$ , $α = 0.1$ :

Iter	$w_{1}$	MSE
0	0.0000	1.0000
1	0.3000	0.4200
2	0.5460	0.1992
3	0.7322	0.0994
4	0.8625	0.0526

$w_{0}$ stays at 0 because $\overset{y}{ˉ}_{scaled} \approx 0$ — the scaled target is already centered. $w_{1}$ converges toward 1.0 in scaled space, which corresponds to $w_{1} = 0.20$ in original space (after unscaling).

<text x="290" y="268" text-anchor="middle" font-size="12" fill="#334155">Iteration</text>
<text x="18" y="130" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,18,130)">MSE</text>

<text x="65" y="253" font-size="10" fill="#64748b">0</text>
<text x="155" y="253" font-size="10" fill="#64748b">20</text>
<text x="255" y="253" font-size="10" fill="#64748b">50</text>
<text x="355" y="253" font-size="10" fill="#64748b">100</text>
<text x="455" y="253" font-size="10" fill="#64748b">200</text>

<path d="M65,38 Q100,60 140,100 Q200,150 280,200 Q360,225 440,235 Q480,238 510,239" fill="none" stroke="#3b82f6" stroke-width="2"/>

<circle cx="65" cy="38" r="4" fill="#f59e0b"/>
<text x="72" y="42" font-size="9" fill="#f59e0b">MSE=1.0 (start)</text>
<circle cx="510" cy="239" r="4" fill="#22c55e"/>
<text x="420" y="232" font-size="9" fill="#22c55e">converged ≈ 0</text>

The Three Variants of Gradient Descent

Variant	Update Uses	Pros	Cons
Batch GD	All $n$ samples per step	Smooth convergence, accurate gradient	Slow on large $n$
Stochastic GD (SGD)	1 random sample per step	Fast per step, can escape shallow minima	Noisy, oscillates
Mini-batch GD	$k$ samples per step ( $k = 32/64/128$ )	Balanced speed and stability	Extra hyperparameter $k$

For our 6-sample anchor, all three give the same final weights — the difference matters at $n = 1, 000, 000$ where batch GD requires computing gradients over all 1M samples each step.

SGD Step Trace — 1 Sample

On the same unscaled anchor, $α = 0.0001$ , randomly pick sample 3 ( $x = 1100$ , $y = 280$ ):

With initial $w_{0} = 0$ , $w_{1} = 0$ : $\overset{y}{^} = 0$ , $ε = 280$

$\frac{\partial J _{SGD}}{\partial w _{0}} = - 2 \times 280 = - 560$

$\frac{\partial J _{SGD}}{\partial w _{1}} = - 2 \times 1100 \times 280 = - 616000$

$w_{0} \leftarrow 0 + 0.0001 \times 560 = 0.056, w_{1} \leftarrow 0 + 0.0001 \times 616000 = 61.6$

One sample gives a noisier gradient than the full batch — but costs $1/6$ the computation. The noise averages out over many iterations.

Learning Rate Sensitivity

<text x="290" y="268" text-anchor="middle" font-size="12" fill="#334155">Iteration</text>
<text x="18" y="130" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,18,130)">MSE</text>

<path d="M65,38 Q200,36 350,34 Q430,33 510,32" fill="none" stroke="#94a3b8" stroke-width="2" stroke-dasharray="5,3"/>
<text x="355" y="30" font-size="10" fill="#94a3b8">α=0.000001 (too small)</text>

<path d="M65,38 Q100,80 150,140 Q230,195 340,228 Q430,237 510,239" fill="none" stroke="#22c55e" stroke-width="2"/>
<text x="420" y="235" font-size="10" fill="#22c55e">α=0.01 (good)</text>

<path d="M65,38 Q80,230 95,50 Q110,230 125,50 Q140,230 155,50 Q200,100 250,80 Q350,60 510,50" fill="none" stroke="#dc2626" stroke-width="1.5"/>
<text x="250" y="73" font-size="10" fill="#dc2626">α=0.5 (oscillates)</text>

$α$ too small: the curve barely moves over 200 iterations — slow but stable.
$α$ right: rapid descent, converges cleanly around 50–100 iterations.
$α$ too large: MSE oscillates up and down, potentially diverging — the step overshoots the minimum.

Code Implementation

python

def gradient_descent(X, y, alpha=0.1, n_iter=200):
    n = len(y)
    w0, w1 = 0.0, 0.0
    history = []

    for _ in range(n_iter):
        y_pred = w0 + w1 * X
        error = y - y_pred
        dw0 = -(2 / n) * error.sum()
        dw1 = -(2 / n) * (X * error).sum()
        w0 -= alpha * dw0
        w1 -= alpha * dw1
        history.append((error ** 2).mean())

    return w0, w1, history

w0_s, w1_s, hist = gradient_descent(X_scaled, y_scaled, alpha=0.1, n_iter=200)
print(f"w₀={w0_s:.4f}, w₁={w1_s:.4f}")
print(f"Final MSE: {hist[-1]:.6f}")

w₀=0.0000, w₁=0.9998
Final MSE: 0.000001

Convergence Criteria

Gradient descent stops when one of three conditions is met:

Loss change < threshold: $∣ J (t) - J (t - 1) ∣ < 1 0^{- 6}$
Weight change < threshold: $∥ w (t) - w (t - 1) ∥ < 1 0^{- 8}$
Max iterations reached — safety stop to prevent infinite loops

Gradient Descent Cheat Sheet

Step	Action
1	Initialize $w_{0} = 0$ , $w_{1} = 0$
2	Compute predictions $\overset{y}{^} = w_{0} + w_{1} X$
3	Compute residuals $ε = y - \overset{y}{^}$
4	Compute gradients: $\partial J / \partial w_{0}$ and $\partial J / \partial w_{1}$
5	Update: $w \leftarrow w - α \nabla J$
6	Repeat until convergence

For linear regression, gradient descent and OLS always arrive at the same weights (both find the unique global minimum). The difference is practical: OLS is $O (n p^{2} + p^{3})$ — impractical when $p$ or $n$ is large. Gradient descent with mini-batches scales to millions of samples and thousands of features.

The honest limitation of gradient descent is its sensitivity to hyperparameters: learning rate $α$ , the number of iterations, and the batch size all require tuning. The learning rate in particular needs to be scaled to the data — which is why feature scaling isn't a preprocessing convenience but a prerequisite for gradient descent to work at reasonable $α$ values.

Test Your Understanding

For the anchor data, compute the gradient $\partial J / \partial w_{1}$ at $w_{0} = 50$ , $w_{1} = 0.15$ manually. Confirm the sign: should gradient descent increase or decrease $w_{1}$ from here?
After one mini-batch gradient descent step with batch size 3 (samples 1, 3, 5 of the anchor), how does the update differ from a full-batch step? Which samples were excluded and what bias does that introduce?
If you scale only $X$ but not $y$ , will gradient descent still converge? Will the learned $w_{1}$ correspond to the correct unscaled coefficient after inverse-transforming?
SGD is described as "noisier" than batch gradient descent. Under what conditions is that noise actually beneficial?
The learning rate $α = 0.5$ caused oscillation on our anchor. Derive the exact maximum stable learning rate for gradient descent on MSE using the Hessian eigenvalue bound $α < 2/ λ_{m a x}$ .

Gradient Descent

The Gradient of MSE

Gradient Descent on Unscaled Data — The Scaling Problem

Feature Scaling Before Gradient Descent

Manual Gradient Descent on Scaled Data — 4 Iterations

The Three Variants of Gradient Descent

SGD Step Trace — 1 Sample

Learning Rate Sensitivity

Code Implementation

Convergence Criteria

Gradient Descent Cheat Sheet

Test Your Understanding

Comments (0)

Leave a comment

Gradient Descent

The Gradient of MSE

Gradient Descent on Unscaled Data — The Scaling Problem

Feature Scaling Before Gradient Descent

Manual Gradient Descent on Scaled Data — 4 Iterations

The Three Variants of Gradient Descent

SGD Step Trace — 1 Sample

Learning Rate Sensitivity

Code Implementation

Convergence Criteria

Gradient Descent Cheat Sheet

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment