~/blog

Regression Cost Functions

Jul 1, 20269 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Regression outputs a continuous number. The loss function for regression measures how far that number is from the true value — but how you measure "far" determines what the model learns to prioritize. MSE penalizes large errors heavily. MAE treats all error magnitudes proportionally. Huber combines both. The choice depends on your data and what kinds of mistakes are acceptable.

Anchor dataset: house price prediction, 5 samples.

python
y_true = [300000, 180000, 450000, 120000, 350000]
y_pred = [320000, 165000, 480000, 135000, 340000]
errors = [20000,  -15000,  30000,  15000, -10000]

MSE — Mean Squared Error

J = (1/n) Σ(yᵢ − ŷᵢ)²

Computing MSE for the Anchor

Sampley_truey_prederrorerror²
1300,000320,000−20,0004.00 × 10⁸
2180,000165,00015,0002.25 × 10⁸
3450,000480,000−30,0009.00 × 10⁸
4120,000135,000−15,0002.25 × 10⁸
5350,000340,00010,0001.00 × 10⁸

J_MSE = (4.00 + 2.25 + 9.00 + 2.25 + 1.00) × 10⁸ / 5 = 18.50 × 10⁸ / 5 = 3.70 × 10⁸

Gradient of MSE

∂J/∂ŷᵢ = −2(yᵢ − ŷᵢ)/n

For Sample 1 (error = −20,000, so y − ŷ = −20,000):

∂J/∂ŷ₁ = −2 × (−20,000) / 5 = +8,000

Positive gradient means increasing ŷ₁ increases the loss — so the update will decrease ŷ₁ (reduce the prediction), which is correct since ŷ₁ = 320,000 > y₁ = 300,000.

MSE Loss — Quadratic Penalty for Errors Error (y − ŷ) MSE Loss −3 −1.5 0 1.5 3 0 Large error → high loss Quadratic growth: 2× error → 4× loss

Key property: MSE penalizes large errors disproportionately. An error of 30,000 (Sample 3) contributes 9.0 × 10⁸ to the sum — four times the contribution of an error of 15,000. This makes the model prioritize reducing the largest errors first, which is correct when large prediction errors are more damaging than small ones.

Outlier sensitivity: Replace Sample 1's prediction with 800,000 (a bad prediction). Error becomes 500,000. Contribution: (500,000)² = 2.5 × 10¹¹ — this one sample now dominates the entire cost. The model will focus almost entirely on this one prediction while ignoring the other four.


MAE — Mean Absolute Error

J = (1/n) Σ|yᵢ − ŷᵢ|

Computing MAE for the Anchor

text
| Sample | y_true | y_pred | |error| |
|--------|--------|--------|--------|
| 1 | 300,000 | 320,000 | 20,000 |
| 2 | 180,000 | 165,000 | 15,000 |
| 3 | 450,000 | 480,000 | 30,000 |
| 4 | 120,000 | 135,000 | 15,000 |
| 5 | 350,000 | 340,000 | 10,000 |

J_MAE = (20,000 + 15,000 + 30,000 + 15,000 + 10,000) / 5 = 90,000 / 5 = 18,000

RMSE (below) = √(3.70 × 10⁸) ≈ 19,235. Both are in dollars — directly comparable to y.

Gradient of MAE

∂J/∂ŷᵢ = −sign(yᵢ − ŷᵢ) / n

For Sample 1 (y − ŷ = −20,000, so sign = −1):

∂J/∂ŷ₁ = −(−1) / 5 = +0.2

Every sample contributes exactly ±0.2 regardless of how large the error is. A prediction that is off by 10 and a prediction that is off by 100,000 both get the same gradient magnitude. MAE is outlier-robust because outliers don't get extra weight in the gradient.

MAE Loss — Linear Penalty (V-shape) Error (y − ŷ) kink — not differentiable −3 0 3 slope = −1 (constant) slope = +1 (constant)

Non-differentiable at zero. The gradient of MAE is undefined when error = 0 (the kink). In practice, subgradient methods are used — the gradient is set to 0 at exactly error = 0. This is not a practical problem since the probability of a floating-point prediction hitting exactly 0 error is essentially zero.

Same outlier experiment: Replace Sample 1's prediction with 800,000. MAE error = 500,000. MAE contribution = 500,000/5 = 100,000. The new MAE is (100,000 + 15,000 + 30,000 + 15,000 + 10,000)/5 = 34,000 — up from 18,000. MSE for the same outlier: 2.5 × 10¹¹ dominates everything. MAE increases linearly; MSE increases quadratically.


RMSE — Root Mean Squared Error

RMSE = √J_MSE = √(3.70 × 10⁸) ≈ 19,235

RMSE is not a separate loss function — it is MSE scaled to the original units (dollars instead of dollars²). Minimizing MSE is identical to minimizing RMSE because √ is a monotone transformation.

RMSE is used for reporting and interpretation, not for training. The gradient of RMSE is:

∂RMSE/∂ŷᵢ = −(yᵢ − ŷᵢ) / (n × RMSE)

This is the MSE gradient scaled by 1/(2 × RMSE). Since you are minimizing RMSE when you minimize MSE, frameworks use MSE directly (no sqrt required per step).


Huber Loss

Lδ(y, ŷ) = { (1/2)(y−ŷ)² if |y−ŷ| ≤ δ ; δ(|y−ŷ| − δ/2) if |y−ŷ| > δ }

Huber is quadratic for small errors and linear for large ones. The transition point δ controls where "small" becomes "large."

For δ = 25,000 with the anchor errors [20000, 15000, 30000, 15000, 10000]:

text
| Sample | |error| | vs δ=25000 | Regime | Huber contribution |
|--------|--------|-----------|--------|-------------------|
| 1 | 20,000 | < 25,000 | Quadratic | ½×(20000)² = 2.00×10⁸ |
| 2 | 15,000 | < 25,000 | Quadratic | ½×(15000)² = 1.125×10⁸ |
| 3 | 30,000 | > 25,000 | Linear | 25000×(30000−12500) = 4.375×10⁸ |
| 4 | 15,000 | < 25,000 | Quadratic | ½×(15000)² = 1.125×10⁸ |
| 5 | 10,000 | < 25,000 | Quadratic | ½×(10000)² = 0.500×10⁸ |

J_Huber = (2.00 + 1.125 + 4.375 + 1.125 + 0.500) × 10⁸ / 5 = 9.125 × 10⁸ / 5 = 1.825 × 10⁸

Sample 3 (largest error) hits the linear regime and is treated proportionally, not quadratically. If Sample 3's prediction were 800,000 (a massive outlier), MSE would dominate at 2.5 × 10¹¹; Huber would treat it linearly and contain the damage.

Huber Loss — Quadratic for Small Errors, Linear for Large Error (y − ŷ) −3δ −δ 0 +3δ −δ linear quadratic linear

Comparison Table

LossFormulaOutlier robustDifferentiableUnitsBest for
MSEΣ(y−ŷ)²/n✗ (quadratic)✓ everywhereClean data, penalize large errors
MAEΣ|y−ŷ|/n✓ (linear)✗ at error=0yOutlier-heavy data, median regression
RMSE√MSEyReporting and interpretability
Huberhybrid✓ (fore>δ)✓ everywhere

Code

python
import numpy as np

y_true = np.array([300000, 180000, 450000, 120000, 350000], dtype=float)
y_pred = np.array([320000, 165000, 480000, 135000, 340000], dtype=float)
errors = y_true - y_pred

mse  = np.mean(errors ** 2)
mae  = np.mean(np.abs(errors))
rmse = np.sqrt(mse)

def huber_loss(y, yhat, delta=25000):
    err = np.abs(y - yhat)
    quad = 0.5 * err ** 2
    lin  = delta * (err - 0.5 * delta)
    return np.mean(np.where(err <= delta, quad, lin))

huber = huber_loss(y_true, y_pred)

print(f"Errors:    {errors.astype(int)}")
print(f"MSE:       {mse:.3e}")
print(f"RMSE:      {rmse:.2f}")
print(f"MAE:       {mae:.2f}")
print(f"Huber(δ=25k): {huber:.3e}")

# Outlier sensitivity
y_pred_outlier = y_pred.copy()
y_pred_outlier[0] = 800000
print(f"\n--- With outlier (sample 1 pred=800,000) ---")
print(f"MSE:  {np.mean((y_true - y_pred_outlier)**2):.3e}")
print(f"MAE:  {np.mean(np.abs(y_true - y_pred_outlier)):.2f}")
print(f"Huber:{huber_loss(y_true, y_pred_outlier):.3e}")
text
Errors:    [-20000  15000 -30000 -15000  10000]
MSE:       3.700e+08
RMSE:      19235.38
MAE:       18000.00
Huber(δ=25k): 1.825e+08

--- With outlier (sample 1 pred=800,000) ---
MSE:  5.038e+10
MAE:  34000.00
Huber:5.075e+09

MSE jumps 136× with the outlier. MAE jumps only 1.9×. Huber sits between them at roughly 28× — the outlier is contained but not ignored.


Where this builds from: Loss vs cost (01) established that the cost is the mean of individual losses. Each function here is a formula for computing that individual loss. The gradient computed here is what backpropagation uses to update weights.

Where this leads: Classification losses cover BCE, CCE, and hinge loss — the equivalents of MSE/MAE for discrete outputs. Huber loss is used in object detection (SSD, Faster R-CNN bounding box regression) because bounding box coordinate errors can be large outliers when the initial anchor is far from the target.


Honest Limitations

MSE assumes Gaussian noise. Minimizing MSE is equivalent to maximum likelihood estimation when errors are normally distributed. If errors follow a heavier-tailed distribution (common in financial or sensor data), MAE (which assumes Laplace noise) or Huber will generalize better.

MAE's non-differentiability at zero is a minor practical concern but matters for second-order optimizers. Optimizers like L-BFGS that use the Hessian cannot be directly applied to MAE. First-order optimizers (Adam, SGD) handle it through the subgradient without modification.

Huber δ is a hyperparameter you must tune. If δ is too small, Huber behaves like MAE everywhere. If δ is too large, Huber behaves like MSE everywhere. The right δ depends on the scale and distribution of errors in your specific dataset.


Test Your Understanding

  1. Your model predicts house prices with RMSE = 25,000 on the validation set. A stakeholder asks "how wrong are your predictions on average?" Is RMSE the right metric to quote? What about MAE? Under what conditions would they be very different from each other?

  2. Add a new sample to the anchor: y_true = 200,000, y_pred = 900,000. Compute the new MSE and MAE by hand (just compute the new squared error and absolute error, then average all 6 values). Which metric changes more dramatically?

  3. The Huber loss gradient for |error| > δ is ±δ/n — a constant. The MSE gradient for the same large error is −2(y−ŷ)/n — proportional to the error. Why does the constant gradient help with outliers? What is the maximum gradient magnitude Huber will ever produce for a single sample, regardless of how wrong the prediction is?

  4. A model is trained with MAE loss and achieves excellent MAE on the test set. Its RMSE is much higher than expected. What does this tell you about the error distribution? Draw or describe what the residual histogram probably looks like.

  5. You use MSE to train a house price model. After training, you analyze the residuals and find that 95% of errors are within ±10,000 but one outlier has an error of 500,000. Describe what happened during training: which sample dominated the gradient, and what does the trained model likely do when it encounters a house similar to the outlier?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment