Back to blog
← View series: machine learning

~/blog

Performance Metrics for Regression

Jun 25, 20266 min readBy Mohammed Vasim
Machine LearningAIData Science

A regression model without a metric is a black box. Metrics translate the abstract concept of "good fit" into a number you can compare, track, and argue about with stakeholders. The catch is that each metric penalizes errors differently — choosing the wrong one for your problem can make a genuinely good model look bad or hide a genuinely bad model's failure modes.

Baseline: The Naive Model

Before evaluating any metric, establish what a trivially dumb model would score. The naive model always predicts , regardless of input.

For our anchor: .

Naive residuals:

SST (Total Sum of Squares):

Every metric below should do significantly better than this baseline. If your model has , it doesn't even beat predicting the mean.

Anchor (OLS fit): ,

Residuals: ,

MAE — Mean Absolute Error

Absolute residuals: . Sum = 26.67.

Interpretation: on average, predictions are off by $4.44k. This is in the original units of — immediately interpretable.

Outlier sensitivity: low. A single sample with residual 200 contributes 200 to the sum. A sample with residual 0.1 contributes 0.1. No squaring means no disproportionate penalty for large errors.

MSE — Mean Squared Error

Squared residuals: . Sum = 133.33.

Units: squared dollars — not directly interpretable in the original scale. You can't tell a stakeholder "our average error is 22.22."

Outlier sensitivity: high. That same residual of 200 becomes 40,000 in squared error — 200× its squared magnitude, versus 200× its absolute value for MAE. MSE makes one large error hurt much more than many small errors.

Why use it? Differentiable everywhere, connects directly to the OLS derivation, and the gradient is clean. It's the loss function used to train the model.

RMSE — Root Mean Squared Error

RMSE is in the same units as , making it interpretable. It's always MAE — the gap between them indicates whether large errors dominate.

Ratio close to 1.0 means errors are uniformly distributed across samples. For a model with one catastrophically large error and five tiny ones, this ratio could be 5–10×.

R² — Coefficient of Determination

  • SSR (residual sum of squares) = 133.33 (our model's SSE)
  • SST = 44533.4 (baseline SSE)

Our model explains 99.7% of the variance in house prices. For 6 points on a near-linear relationship, this is expected. means perfect fit; means no better than the naive mean model; means your model is worse than predicting the mean.

is unitless and scale-invariant — you can compare it across datasets with different scales. MSE and RMSE cannot be compared across datasets.

Adjusted R²

Adding any feature — even random noise — can only maintain or increase . Adjusted corrects for this:

Simple regression ():

Multiple regression ():

Adding bedrooms as a second feature slightly decreased adjusted from 0.996 to 0.995 — bedrooms didn't improve the model enough to justify the extra parameter. This is adjusted doing its job.

MAPE — Mean Absolute Percentage Error

| | | | | |---|---|---|---| | 180 | 183.33 | 3.33 | 1.85% | | 220 | 223.33 | 3.33 | 1.51% | | 280 | 273.33 | 6.67 | 2.38% | | 340 | 333.33 | 6.67 | 1.96% | | 370 | 373.33 | 3.33 | 0.90% | | 430 | 433.33 | 3.33 | 0.77% | | MAPE | | | 1.56% |

Predictions are within 1.56% of actual price on average. Business stakeholders understand percentages better than abstract MSE values.

Warning: MAPE is undefined when . For demand forecasting where zero demand is common, use sMAPE or WMAPE instead.

Code Verification

python
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

y_true = np.array([180, 220, 280, 340, 370, 430])
y_pred = np.array([183.33, 223.33, 273.33, 333.33, 373.33, 433.33])

mae  = mean_absolute_error(y_true, y_pred)
mse  = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2   = r2_score(y_true, y_pred)
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

print(f"MAE:  {mae:.2f}")
print(f"MSE:  {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R²:   {r2:.4f}")
print(f"MAPE: {mape:.2f}%")
MAE: 4.44 MSE: 22.22 RMSE: 4.71 R²: 0.9970 MAPE: 1.56%

Metrics Reference

MetricValueUnitsOutlier Sensitive?Scale-Dependent?
MAE4.44$kNoYes
MSE22.22$k²YesYes
RMSE4.71$kYes (less than MSE)Yes
0.997NoNo
Adjusted 0.996NoNo
MAPE1.56%%NoNo

When to Use Which Metric

  • MAE: Use when outliers should not be heavily penalized and interpretability in original units matters. Demand forecasting, where occasional spikes are acceptable.
  • RMSE: Standard for most ML benchmarks. Use when large errors are particularly costly — autonomous vehicle trajectory prediction, financial risk modeling.
  • : Use to compare models on the same dataset. Never compare across datasets with different scales.
  • MAPE: Use for business stakeholders who need a percentage — "our model is within 2% of actual price." Fails when can be zero.

on this 6-sample dataset is misleadingly high — six points on a near-linear relationship will always show a high . On a real dataset with 10,000 samples, might be excellent. Never use to compare models across different datasets or different targets.

The deeper limitation: all metrics here are average-case measures. A model that's highly accurate for 95% of samples but catastrophically wrong for 5% can still show a low RMSE. For high-stakes applications, examine the worst-case error distribution — not just the mean.

Test Your Understanding

  1. If you added a constant 10 to every prediction (), how would MAE, MSE, and each change?

  2. Two models: Model A has MAE = 5.0, RMSE = 5.1. Model B has MAE = 3.0, RMSE = 9.0. Which model has larger individual errors, and when would you prefer Model B despite its higher RMSE?

  3. Adjusted decreased when we added bedrooms (0.996 → 0.995). How many additional features could you add before adjusted would reach zero?

  4. MAPE can be undefined. Construct a dataset where MAPE gives a misleading evaluation even when for all .

  5. Suppose the test set has one house listed at $2000k (a data entry error). How does this single outlier affect RMSE vs MAE? Which metric should you trust for model evaluation, and why?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment