Back to blog
← View series: machine learning

Types of Machine Learning Equation of a Line, 3D Plane, and Hyperplane Distance of a Point from a Plane Instance-Based vs Model-Based Learning Simple Linear Regression Cost Function in Linear Regression Gradient Descent Multiple Linear Regression Performance Metrics for Regression Overfitting and Underfitting Linear Regression OLS: The Normal Equation Practicals: Simple and Multiple Linear Regression Polynomial Regression Ridge, Lasso, and ElasticNet Regression Cross-Validation End-to-End ML Project: Linear Regression

~/blog

Cost Function in Linear Regression

Jun 25, 2026•6 min read•By Mohammed Vasim

Machine LearningAIData Science

The cost function is not just a number to minimize — it's the landscape the optimizer navigates. The shape of that landscape determines whether gradient descent converges, how fast it converges, and whether it can get stuck. For linear regression, that shape is a bowl, and understanding why it's a bowl is what gives you confidence that optimization will always succeed.

From Loss to Cost

A loss function measures error for a single sample: $L (\overset{y}{^}_{i}, y_{i}) = (y_{i} - \overset{y}{^}_{i})^{2}$ .

A cost function averages that loss over all $n$ training samples:

$J (w_{0}, w_{1}) = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - w_{0} - w_{1} x_{i})^{2}$

This is the Mean Squared Error (MSE). Some textbooks use $\frac{1}{2 n}$ to cancel the factor of 2 from the derivative — this changes the scale but not the minimizer.

Why MSE and not MAE? MSE is differentiable everywhere. MAE has a non-differentiable kink at zero. Differentiability is what allows gradient descent and the OLS closed form to work.

Anchor dataset: $X = [650, 850, 1100, 1400, 1600, 1900]$ , $y = [180, 220, 280, 340, 370, 430]$ . True OLS solution: $w_{0} = 53.33$ , $w_{1} = 0.20$ .

MSE for Three Candidate Models

Hold $w_{0} = 53.33$ fixed and vary $w_{1}$ to see how MSE changes:

$w_{1} = 0.10$ (under-estimating the slope):

Predictions: $[118.3, 138.3, 163.3, 193.3, 213.3, 243.3]$ . Residuals: $[61.7, 81.7, 116.7, 146.7, 156.7, 186.7]$ . SSE ≈ 120,000. MSE ≈ 20,000.

$w_{1} = 0.20$ (optimal slope):

Predictions: $[183.3, 223.3, 273.3, 333.3, 373.3, 433.3]$ . Residuals: $[- 3.3, - 3.3, 6.7, 6.7, - 3.3, - 3.3]$ . SSE = 133.3. MSE = 22.2.

$w_{1} = 0.30$ (over-estimating the slope):

Predictions: $[248.3, 308.3, 383.3, 473.3, 533.3, 623.3]$ . Residuals: $[- 68.3, - 88.3, - 103.3, - 133.3, - 163.3, - 193.3]$ . SSE ≈ 120,000. MSE ≈ 20,000.

The minimum at $w_{1} = 0.20$ gives MSE = 22.2. Deviating in either direction increases MSE rapidly — the parabola is steep.

<text x="290" y="268" text-anchor="middle" font-size="12" fill="#334155">w₁ (slope)</text>
<text x="18" y="130" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,18,130)">MSE</text>

<text x="70" y="253" font-size="10" fill="#64748b">0.05</text>
<text x="163" y="253" font-size="10" fill="#64748b">0.10</text>
<text x="256" y="253" font-size="10" fill="#64748b">0.15</text>
<text x="349" y="253" font-size="10" fill="#64748b">0.20</text>
<text x="442" y="253" font-size="10" fill="#64748b">0.25</text>

<text x="55" y="244" text-anchor="end" font-size="10" fill="#64748b">0</text>
<text x="55" y="194" text-anchor="end" font-size="10" fill="#64748b">5k</text>
<text x="55" y="144" text-anchor="end" font-size="10" fill="#64748b">10k</text>
<text x="55" y="94" text-anchor="end" font-size="10" fill="#64748b">15k</text>
<text x="55" y="44" text-anchor="end" font-size="10" fill="#64748b">20k</text>

<path d="M70,44 Q163,194 256,234 Q349,238 442,194 Q490,160 510,100" fill="none" stroke="#3b82f6" stroke-width="2.5"/>

<circle cx="349" cy="238" r="6" fill="#f59e0b" stroke="#f59e0b"/>
<line x1="349" y1="20" x2="349" y2="238" stroke="#f59e0b" stroke-width="1" stroke-dasharray="4,3"/>
<text x="355" y="232" font-size="10" fill="#f59e0b">w₁=0.20</text>
<text x="355" y="222" font-size="10" fill="#f59e0b">MSE=22.2</text>

<text x="250" y="80" font-size="11" fill="#3b82f6">MSE(w₁) — bowl shape</text>
<text x="130" y="200" font-size="9" fill="#64748b">↓ gradient descent</text>
<text x="130" y="212" font-size="9" fill="#64748b">converges here</text>

The Full Loss Surface — 3D View (Both $w_{0}$ and $w_{1}$ )

When you free both parameters, the cost function is a paraboloid — a bowl that curves upward in every direction from a single minimum at $(w_{0} = 53.33, w_{1} = 0.20)$ .

<ellipse cx="280" cy="260" rx="220" ry="30" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1"/>
<ellipse cx="280" cy="240" rx="180" ry="24" fill="#dbeafe" stroke="#93c5fd" stroke-width="1"/>
<ellipse cx="280" cy="220" rx="130" ry="17" fill="#bfdbfe" stroke="#60a5fa" stroke-width="1.2"/>
<ellipse cx="280" cy="200" rx="80" ry="11" fill="#93c5fd" stroke="#3b82f6" stroke-width="1.5"/>
<ellipse cx="280" cy="185" rx="40" ry="6" fill="#60a5fa" stroke="#1d4ed8" stroke-width="1.5"/>

<path d="M60,260 Q200,200 280,140 Q360,200 500,260" fill="none" stroke="#64748b" stroke-width="1.5"/>
<path d="M280,260 L280,140" stroke="#64748b" stroke-width="1.5"/>
<path d="M60,260 L280,260 L500,260" stroke="#334155" stroke-width="1.5"/>

<circle cx="280" cy="140" r="6" fill="#f59e0b" stroke="#f59e0b"/>
<text x="290" y="138" font-size="10" fill="#f59e0b">minimum</text>
<text x="290" y="150" font-size="9" fill="#64748b">w₀=53.33, w₁=0.20</text>

<text x="65" y="275" font-size="11" fill="#334155">w₀</text>
<text x="495" y="275" font-size="11" fill="#334155">w₁</text>
<text x="285" y="130" font-size="11" fill="#334155">MSE</text>

<text x="100" y="245" font-size="9" fill="#64748b">contour rings</text>

This shape is guaranteed because MSE is a sum of squares — quadratic in $w$ . A quadratic has exactly one minimum for linear regression. That's what makes gradient descent safe here: no local minima to get trapped in.

Why MSE Is Convex — The Key Property

A function $f$ is convex if $f (λa + (1 - λ) b) \leq λ f (a) + (1 - λ) f (b)$ for all $a, b, λ \in [0, 1]$ .

MSE is quadratic in $w$ , so its Hessian is $H = \frac{2}{n} X^{⊤} X$ . If $X$ has full column rank (no redundant features), $H$ is positive definite — meaning the surface curves upward in every direction — and MSE is strictly convex with a unique global minimum.

Practical implication: gradient descent on MSE with linear regression will always converge to the optimal weights, regardless of starting point or step size (as long as the step size is small enough).

Cost Functions for Other Scenarios

Name	Formula	When to Use
MSE (L2 Loss)	$(1/ n) \sum (y_{i} - \overset{y}{^}_{i})^{2}$	Standard regression; penalizes outliers heavily
MAE (L1 Loss)	$(1/ n) \sum ∥ y_{i} - \overset{y}{^}_{i} ∥$	Regression with outliers; robust but not differentiable
Huber Loss	Quadratic for $∥ ε ∥ \leq δ$ , linear beyond	Best of both: robust + differentiable
RMSE	$MSE$	Same minimum as MSE; interpretable in original units

Computing MAE and RMSE at the optimal weights:

Residuals: $[- 3.3, - 3.3, 6.7, 6.7, - 3.3, - 3.3]$

$MAE = \frac{3.3 + 3.3 + 6.7 + 6.7 + 3.3 + 3.3}{6} = \frac{26.6}{6} = 4.43$

$RMSE = 22.2 = 4.71$

RMSE and MAE are close here (4.71 vs 4.43, ratio ≈ 1.06) because the residuals are uniform — only two distinct values: 3.3 and 6.7. When a model has a few large outlier-driven errors, this ratio grows significantly.

Plotting the Loss Curve

python

import numpy as np
import matplotlib.pyplot as plt

X = np.array([650, 850, 1100, 1400, 1600, 1900])
y = np.array([180, 220, 280, 340, 370, 430])
w0 = 53.33

w1_values = np.linspace(0.05, 0.35, 100)
mse_values = [np.mean((y - (w0 + w1 * X)) ** 2) for w1 in w1_values]

plt.plot(w1_values, mse_values)
plt.axvline(x=0.20, color='red', linestyle='--', label='Optimal w₁=0.20')
plt.xlabel('w₁ (slope)')
plt.ylabel('MSE')
plt.title('Loss Landscape — MSE vs w₁')
plt.legend()
plt.show()

# Output: parabola curve with minimum at w₁=0.20, MSE≈22.2
# Steep rise in both directions confirms unique global minimum

MSE Trace at the Optimal Weights

$x_{i}$	$y_{i}$	$\overset{y}{^}_{i}$	$ε_{i}$	$ε_{i}^{2}$
650	180	183.3	−3.3	10.9
850	220	223.3	−3.3	10.9
1100	280	273.3	6.7	44.9
1400	340	333.3	6.7	44.9
1600	370	373.3	−3.3	10.9
1900	430	433.3	−3.3	10.9
			SSE	133.3
			MSE	22.2

The bowl shape of MSE is specific to linear regression. The moment you add a sigmoid (logistic regression) or a ReLU network on top of the linear output, the loss landscape gains saddle points and potentially many local minima. Gradient descent is no longer guaranteed to find the global optimum — it finds a good local one, which is usually sufficient in practice but theoretically weaker.

MAE's non-differentiability at zero isn't catastrophic — subgradient methods and coordinate descent can minimize it. But they're slower and more complex than the OLS closed form or standard gradient descent for MSE. Huber loss is the practical compromise when you want robustness to outliers without giving up differentiability.

Test Your Understanding

For $w_{1} = 0.10$ (holding $w_{0} = 53.33$ ), compute the MSE manually by listing each residual and squaring it. Verify it's approximately 20,000.
The Hessian of MSE is $H = \frac{2}{n} X^{⊤} X$ . For the 1-feature anchor, compute $X^{⊤} X$ (a 2×2 matrix with the bias column included) and verify it's positive definite.
Why does MSE penalize outliers more than MAE? Sketch two error scenarios — one with uniform small errors, one with one large error — and compare MSE vs MAE for each.
If you use $J = \frac{1}{2 n} \sum ε_{i}^{2}$ instead of $\frac{1}{n} \sum ε_{i}^{2}$ , does the optimal $w_{0}$ and $w_{1}$ change? Does the gradient change?
Huber loss transitions from quadratic to linear at $∣ ε ∣ = δ$ . What would the loss curve shape look like as a function of $w_{1}$ ? Would it still have a unique minimum?

Cost Function in Linear Regression

From Loss to Cost

MSE for Three Candidate Models

The Full Loss Surface — 3D View (Both $w_{0}$ and $w_{1}$ )

Why MSE Is Convex — The Key Property

Cost Functions for Other Scenarios

Plotting the Loss Curve

MSE Trace at the Optimal Weights

Test Your Understanding

Comments (0)

Leave a comment

Cost Function in Linear Regression

From Loss to Cost

MSE for Three Candidate Models

The Full Loss Surface — 3D View (Both w0​ and w1​)

Why MSE Is Convex — The Key Property

Cost Functions for Other Scenarios

Plotting the Loss Curve

MSE Trace at the Optimal Weights

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

The Full Loss Surface — 3D View (Both $w_{0}$ and $w_{1}$ )