← View series: machine learning
~/blog
Cost Function in Linear Regression
The cost function is not just a number to minimize — it's the landscape the optimizer navigates. The shape of that landscape determines whether gradient descent converges, how fast it converges, and whether it can get stuck. For linear regression, that shape is a bowl, and understanding why it's a bowl is what gives you confidence that optimization will always succeed.
From Loss to Cost
A loss function measures error for a single sample: .
A cost function averages that loss over all training samples:
This is the Mean Squared Error (MSE). Some textbooks use to cancel the factor of 2 from the derivative — this changes the scale but not the minimizer.
Why MSE and not MAE? MSE is differentiable everywhere. MAE has a non-differentiable kink at zero. Differentiability is what allows gradient descent and the OLS closed form to work.
Anchor dataset: , . True OLS solution: , .
MSE for Three Candidate Models
Hold fixed and vary to see how MSE changes:
(under-estimating the slope):
Predictions: . Residuals: . SSE ≈ 120,000. MSE ≈ 20,000.
(optimal slope):
Predictions: . Residuals: . SSE = 133.3. MSE = 22.2.
(over-estimating the slope):
Predictions: . Residuals: . SSE ≈ 120,000. MSE ≈ 20,000.
The minimum at gives MSE = 22.2. Deviating in either direction increases MSE rapidly — the parabola is steep.
<text x="290" y="268" text-anchor="middle" font-size="12" fill="#334155">w₁ (slope)</text>
<text x="18" y="130" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,18,130)">MSE</text>
<text x="70" y="253" font-size="10" fill="#64748b">0.05</text>
<text x="163" y="253" font-size="10" fill="#64748b">0.10</text>
<text x="256" y="253" font-size="10" fill="#64748b">0.15</text>
<text x="349" y="253" font-size="10" fill="#64748b">0.20</text>
<text x="442" y="253" font-size="10" fill="#64748b">0.25</text>
<text x="55" y="244" text-anchor="end" font-size="10" fill="#64748b">0</text>
<text x="55" y="194" text-anchor="end" font-size="10" fill="#64748b">5k</text>
<text x="55" y="144" text-anchor="end" font-size="10" fill="#64748b">10k</text>
<text x="55" y="94" text-anchor="end" font-size="10" fill="#64748b">15k</text>
<text x="55" y="44" text-anchor="end" font-size="10" fill="#64748b">20k</text>
<path d="M70,44 Q163,194 256,234 Q349,238 442,194 Q490,160 510,100" fill="none" stroke="#3b82f6" stroke-width="2.5"/>
<circle cx="349" cy="238" r="6" fill="#f59e0b" stroke="#f59e0b"/>
<line x1="349" y1="20" x2="349" y2="238" stroke="#f59e0b" stroke-width="1" stroke-dasharray="4,3"/>
<text x="355" y="232" font-size="10" fill="#f59e0b">w₁=0.20</text>
<text x="355" y="222" font-size="10" fill="#f59e0b">MSE=22.2</text>
<text x="250" y="80" font-size="11" fill="#3b82f6">MSE(w₁) — bowl shape</text>
<text x="130" y="200" font-size="9" fill="#64748b">↓ gradient descent</text>
<text x="130" y="212" font-size="9" fill="#64748b">converges here</text>
The Full Loss Surface — 3D View (Both and )
When you free both parameters, the cost function is a paraboloid — a bowl that curves upward in every direction from a single minimum at .
<ellipse cx="280" cy="260" rx="220" ry="30" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1"/>
<ellipse cx="280" cy="240" rx="180" ry="24" fill="#dbeafe" stroke="#93c5fd" stroke-width="1"/>
<ellipse cx="280" cy="220" rx="130" ry="17" fill="#bfdbfe" stroke="#60a5fa" stroke-width="1.2"/>
<ellipse cx="280" cy="200" rx="80" ry="11" fill="#93c5fd" stroke="#3b82f6" stroke-width="1.5"/>
<ellipse cx="280" cy="185" rx="40" ry="6" fill="#60a5fa" stroke="#1d4ed8" stroke-width="1.5"/>
<path d="M60,260 Q200,200 280,140 Q360,200 500,260" fill="none" stroke="#64748b" stroke-width="1.5"/>
<path d="M280,260 L280,140" stroke="#64748b" stroke-width="1.5"/>
<path d="M60,260 L280,260 L500,260" stroke="#334155" stroke-width="1.5"/>
<circle cx="280" cy="140" r="6" fill="#f59e0b" stroke="#f59e0b"/>
<text x="290" y="138" font-size="10" fill="#f59e0b">minimum</text>
<text x="290" y="150" font-size="9" fill="#64748b">w₀=53.33, w₁=0.20</text>
<text x="65" y="275" font-size="11" fill="#334155">w₀</text>
<text x="495" y="275" font-size="11" fill="#334155">w₁</text>
<text x="285" y="130" font-size="11" fill="#334155">MSE</text>
<text x="100" y="245" font-size="9" fill="#64748b">contour rings</text>
This shape is guaranteed because MSE is a sum of squares — quadratic in . A quadratic has exactly one minimum for linear regression. That's what makes gradient descent safe here: no local minima to get trapped in.
Why MSE Is Convex — The Key Property
A function is convex if for all .
MSE is quadratic in , so its Hessian is . If has full column rank (no redundant features), is positive definite — meaning the surface curves upward in every direction — and MSE is strictly convex with a unique global minimum.
Practical implication: gradient descent on MSE with linear regression will always converge to the optimal weights, regardless of starting point or step size (as long as the step size is small enough).
Cost Functions for Other Scenarios
| Name | Formula | When to Use |
|---|---|---|
| MSE (L2 Loss) | Standard regression; penalizes outliers heavily | |
| MAE (L1 Loss) | Regression with outliers; robust but not differentiable | |
| Huber Loss | Quadratic for , linear beyond | Best of both: robust + differentiable |
| RMSE | Same minimum as MSE; interpretable in original units |
Computing MAE and RMSE at the optimal weights:
Residuals:
RMSE and MAE are close here (4.71 vs 4.43, ratio ≈ 1.06) because the residuals are uniform — only two distinct values: 3.3 and 6.7. When a model has a few large outlier-driven errors, this ratio grows significantly.
Plotting the Loss Curve
import numpy as np
import matplotlib.pyplot as plt
X = np.array([650, 850, 1100, 1400, 1600, 1900])
y = np.array([180, 220, 280, 340, 370, 430])
w0 = 53.33
w1_values = np.linspace(0.05, 0.35, 100)
mse_values = [np.mean((y - (w0 + w1 * X)) ** 2) for w1 in w1_values]
plt.plot(w1_values, mse_values)
plt.axvline(x=0.20, color='red', linestyle='--', label='Optimal w₁=0.20')
plt.xlabel('w₁ (slope)')
plt.ylabel('MSE')
plt.title('Loss Landscape — MSE vs w₁')
plt.legend()
plt.show()# Output: parabola curve with minimum at w₁=0.20, MSE≈22.2
# Steep rise in both directions confirms unique global minimum
MSE Trace at the Optimal Weights
| 650 | 180 | 183.3 | −3.3 | 10.9 |
| 850 | 220 | 223.3 | −3.3 | 10.9 |
| 1100 | 280 | 273.3 | 6.7 | 44.9 |
| 1400 | 340 | 333.3 | 6.7 | 44.9 |
| 1600 | 370 | 373.3 | −3.3 | 10.9 |
| 1900 | 430 | 433.3 | −3.3 | 10.9 |
| SSE | 133.3 | |||
| MSE | 22.2 |
Related Concepts and Honest Limitations
The bowl shape of MSE is specific to linear regression. The moment you add a sigmoid (logistic regression) or a ReLU network on top of the linear output, the loss landscape gains saddle points and potentially many local minima. Gradient descent is no longer guaranteed to find the global optimum — it finds a good local one, which is usually sufficient in practice but theoretically weaker.
MAE's non-differentiability at zero isn't catastrophic — subgradient methods and coordinate descent can minimize it. But they're slower and more complex than the OLS closed form or standard gradient descent for MSE. Huber loss is the practical compromise when you want robustness to outliers without giving up differentiability.
Test Your Understanding
-
For (holding ), compute the MSE manually by listing each residual and squaring it. Verify it's approximately 20,000.
-
The Hessian of MSE is . For the 1-feature anchor, compute (a 2×2 matrix with the bias column included) and verify it's positive definite.
-
Why does MSE penalize outliers more than MAE? Sketch two error scenarios — one with uniform small errors, one with one large error — and compare MSE vs MAE for each.
-
If you use instead of , does the optimal and change? Does the gradient change?
-
Huber loss transitions from quadratic to linear at . What would the loss curve shape look like as a function of ? Would it still have a unique minimum?