~/blog
Stochastic Gradient Descent (SGD)
Batch gradient descent computes the exact gradient — but it requires processing all n training samples before taking a single weight update. With 1 million samples, that means 1 million forward and backward passes to move the weights once. SGD makes a completely different trade: use one sample, compute an approximation of the gradient, update immediately. Then move to the next sample.
The gradient estimate from one sample is noisy. That noise turns out to be useful, not just a problem to tolerate.
Same anchor: salary from experience, normalized.
X_n = [-1.414, -0.707, 0.000, 0.707, 1.414] # normalized years experience
y_n = [-1.414, -0.707, 0.000, 0.707, 1.414] # normalized salaryStart: w=0, b=0, η=0.1.
What Changes from Batch GD
| Batch GD | SGD | |
|---|---|---|
| Samples per update | All n | 1 |
| Updates per epoch | 1 | n |
| Gradient | Exact: ∂J/∂w | Approximate: ∂L⁽ⁱ⁾/∂w |
| Computation per step | Expensive | Cheap |
For 5 samples, batch GD takes 1 update per epoch; SGD takes 5. For 1 million samples, the ratio is 1:1,000,000 per epoch.
The SGD update uses the single-sample loss gradient, not the full cost gradient:
∂L⁽ⁱ⁾/∂w = 2(ŷᵢ − yᵢ) · xᵢ (no 1/n averaging — just one sample)
w ← w − η · ∂L⁽ⁱ⁾/∂w
Epoch 1 — All 5 Update Steps
Starting with w=0, b=0. Each sample updates the weights before the next sample is processed.
Step 1 — Sample 1 (x₁=−1.414, y₁=−1.414): ŷ₁ = 0×(−1.414) + 0 = 0.000 L₁ = (0 − (−1.414))² = 2.000 ∂L/∂w = 2(0 − (−1.414))(−1.414) = 2 × 1.414 × (−1.414) = −4.000 ∂L/∂b = 2(0 − (−1.414)) = 2.828 w = 0 − 0.1 × (−4.000) = 0.400, b = 0 − 0.1 × 2.828 = −0.283
Step 2 — Sample 2 (x₂=−0.707, y₂=−0.707): ŷ₂ = 0.400×(−0.707) + (−0.283) = −0.283 − 0.283 = −0.566 L₂ = (−0.566 − (−0.707))² = (0.141)² = 0.020 ∂L/∂w = 2 × (−0.566 − (−0.707)) × (−0.707) = 2 × 0.141 × (−0.707) = −0.199 ∂L/∂b = 2 × 0.141 = 0.282 w = 0.400 − 0.1 × (−0.199) = 0.420, b = −0.283 − 0.1 × 0.282 = −0.311
Step 3 — Sample 3 (x₃=0.000, y₃=0.000): ŷ₃ = 0.420×0 + (−0.311) = −0.311 L₃ = (−0.311 − 0)² = 0.097 ∂L/∂w = 2 × (−0.311) × 0 = 0.000 ∂L/∂b = 2 × (−0.311) = −0.622 w = 0.420 (unchanged), b = −0.311 − 0.1 × (−0.622) = −0.249
Step 4 — Sample 4 (x₄=0.707, y₄=0.707): ŷ₄ = 0.420×0.707 + (−0.249) = 0.297 − 0.249 = 0.048 L₄ = (0.048 − 0.707)² = 0.434 ∂L/∂w = 2 × (0.048 − 0.707) × 0.707 = 2 × (−0.659) × 0.707 = −0.932 ∂L/∂b = 2 × (−0.659) = −1.318 w = 0.420 − 0.1 × (−0.932) = 0.513, b = −0.249 − 0.1 × (−1.318) = −0.117
Step 5 — Sample 5 (x₅=1.414, y₅=1.414): ŷ₅ = 0.513×1.414 + (−0.117) = 0.725 − 0.117 = 0.608 L₅ = (0.608 − 1.414)² = 0.650 ∂L/∂w = 2 × (0.608 − 1.414) × 1.414 = 2 × (−0.806) × 1.414 = −2.281 ∂L/∂b = 2 × (−0.806) = −1.612 w = 0.513 − 0.1 × (−2.281) = 0.741, b = −0.117 − 0.1 × (−1.612) = 0.044
After one epoch (5 updates): w = 0.741, b ≈ 0. Compare to batch GD after one epoch: w = 0.200. SGD is much further along the loss surface after the same number of samples because it takes 5 update steps instead of 1.
Trace Table
| Step | Sample | x_norm | y_norm | ŷ | L | w_new | b_new |
|---|---|---|---|---|---|---|---|
| 1 | 1 | −1.414 | −1.414 | 0.000 | 2.000 | 0.400 | −0.283 |
| 2 | 2 | −0.707 | −0.707 | −0.566 | 0.020 | 0.420 | −0.311 |
| 3 | 3 | 0.000 | 0.000 | −0.311 | 0.097 | 0.420 | −0.249 |
| 4 | 4 | 0.707 | 0.707 | 0.048 | 0.434 | 0.513 | −0.117 |
| 5 | 5 | 1.414 | 1.414 | 0.608 | 0.650 | 0.741 | 0.044 |
Why SGD is Noisy (and Why That's Useful)
The gradient from one sample is an estimator of the true gradient ∂J/∂w. It is unbiased (correct in expectation) but has high variance. This means the loss curve bounces rather than descending smoothly.
The noise is not purely bad. It helps for two reasons:
1. Escapes saddle points. In deep neural networks, the main obstacle is not local minima but saddle points — points where some directions are uphill and others are downhill. The exact gradient at a saddle point is zero, so batch GD stops. A noisy gradient from one sample has a random component that kicks the optimizer off the saddle.
2. Finds wider minima. Wide minima (shallow, broad valleys) generalize better than sharp minima (narrow, deep peaks). Noisy gradient updates tend to overshoot sharp minima and settle into broader ones. Batch GD finds the nearest minimum — which may be sharp.
Online Learning Property
SGD can process samples as they arrive — no need to store or see all data before updating. This enables:
- Fraud detection: each new transaction triggers a gradient update, keeping the model current
- Recommendation systems: user interactions immediately adjust the embedding weights
- Very large datasets: the model trains on a stream, not a stored collection
Learning Rate Sensitivity
SGD is more sensitive to η than batch GD because each gradient estimate has higher variance. A large η combined with a noisy gradient is more likely to overshoot.
- η = 0.01: stable but slow — the variance of single-sample gradients is under control
- η = 0.1: good convergence for this dataset
- η = 0.5: often diverges within 2–3 epochs, especially when a single sample has an extreme gradient
Code
import numpy as np
X_n = np.array([-1.414, -0.707, 0.000, 0.707, 1.414])
y_n = np.array([-1.414, -0.707, 0.000, 0.707, 1.414])
w, b, lr = 0.0, 0.0, 0.1
print(f"{'Step':>4} | {'x':>6} | {'y':>6} | {'ŷ':>6} | {'L':>6} | {'w':>6} | {'b':>7}")
print("-" * 60)
for i, (xi, yi) in enumerate(zip(X_n, y_n)):
yhat = w * xi + b
loss = (yhat - yi) ** 2
dw = 2 * (yhat - yi) * xi
db = 2 * (yhat - yi)
w -= lr * dw
b -= lr * db
print(f"{i+1:>4} | {xi:>6.3f} | {yi:>6.3f} | {yhat:>6.3f} | {loss:>6.4f} | {w:>6.4f} | {b:>7.4f}")Step | x | y | ŷ | L | w | b
------------------------------------------------------------
1 | -1.414 | -1.414 | 0.000 | 2.0000 | 0.4000 | -0.2828
2 | -0.707 | -0.707 | -0.566 | 0.0200 | 0.4200 | -0.3110
3 | 0.000 | 0.000 | -0.311 | 0.0967 | 0.4200 | -0.2486
4 | 0.707 | 0.707 | 0.048 | 0.4330 | 0.5132 | -0.1168
5 | 1.414 | 1.414 | 0.608 | 0.6497 | 0.7414 | 0.0444Related Concepts
Where this builds from: Batch gradient descent (01) defined the update rule using the full cost gradient ∂J/∂w. SGD replaces that with the single-sample gradient ∂L⁽ⁱ⁾/∂w — same formula, different denominator.
Where this leads: Mini-batch SGD (next post) is the practical middle ground between batch GD and SGD. Momentum (04) smooths SGD's noisy gradient by accumulating a velocity over past gradients — it directly addresses the oscillation visible in the loss curve above.
Honest Limitations
SGD requires a decreasing learning rate schedule to converge exactly. The theoretical guarantee for SGD to converge to a stationary point requires η to decrease over time (Σηₜ = ∞ and Σηₜ² < ∞). In practice, constant η is used with early stopping — the model is stopped before it can diverge — but this means SGD doesn't strictly converge to the minimum.
The order of samples matters. If samples are sorted by label value (all class 0 first, then all class 1), the gradients in early steps are biased toward class 0 and the model will oscillate. Shuffling before each epoch removes this bias. This is why the PyTorch DataLoader shuffles by default.
Online learning prevents batched GPU computation. Processing one sample at a time does not utilize the parallelism of modern GPUs, which are optimized for matrix operations over batches. Pure SGD is slower than mini-batch SGD on GPU hardware even if it takes fewer epochs.
Test Your Understanding
-
In Step 1, the gradient ∂L/∂w = −4.000 for a single sample is twice as large as the full-batch gradient ∂J/∂w = −2.000. Explain why the single-sample gradient is larger in this case. Would this always be true? Give a counterexample where a single-sample gradient is smaller than the batch gradient.
-
After Step 3, w=0.420 is unchanged from Step 2. This is because sample 3 has x₃=0. Mathematically, show why a sample at x=0 can never update w (regardless of ŷ and y values). What does this imply about the informativeness of zero-feature samples for weight learning?
-
SGD processes 5 samples per epoch, taking 5 weight updates. Batch GD processes 5 samples per epoch, taking 1 weight update. After 10 epochs, SGD has taken 50 weight updates; batch GD has taken 10. If each update costs ε time regardless of batch size, which algorithm reaches w=0.9 first, and why?
-
The "escaping saddle points" argument for SGD noise assumes the noise is zero-mean. Verify this for our dataset: what is the mean of the 5 single-sample gradients (∂L⁽ⁱ⁾/∂w) computed at w=0, b=0? Is the mean equal to the batch gradient ∂J/∂w = −2.000?
-
A fraud detection system uses SGD to update a model on each incoming transaction. The model starts with good weights from a batch training run. After 10,000 transactions (all legitimate), the model has never seen a fraud example. What happens to the fraud-detection weights, and why? How would you redesign the update rule to prevent this?