Back to blog
← View series: machine learning

Types of Machine Learning Equation of a Line, 3D Plane, and Hyperplane Distance of a Point from a Plane Instance-Based vs Model-Based Learning Simple Linear Regression Cost Function in Linear Regression Gradient Descent Multiple Linear Regression Performance Metrics for Regression Overfitting and Underfitting Linear Regression OLS: The Normal Equation Practicals: Simple and Multiple Linear Regression Polynomial Regression Ridge, Lasso, and ElasticNet Regression Cross-Validation End-to-End ML Project: Linear Regression

~/blog

Performance Metrics for Regression

Jun 25, 2026•6 min read•By Mohammed Vasim

Machine LearningAIData Science

A regression model without a metric is a black box. Metrics translate the abstract concept of "good fit" into a number you can compare, track, and argue about with stakeholders. The catch is that each metric penalizes errors differently — choosing the wrong one for your problem can make a genuinely good model look bad or hide a genuinely bad model's failure modes.

Baseline: The Naive Model

Before evaluating any metric, establish what a trivially dumb model would score. The naive model always predicts $\overset{y}{ˉ}$ , regardless of input.

For our anchor: $\overset{y}{ˉ} = 303.33$ .

Naive residuals: $[180 - 303.33, 220 - 303.33, 280 - 303.33, 340 - 303.33, 370 - 303.33, 430 - 303.33]$ $= [- 123.33, - 83.33, - 23.33, 36.67, 66.67, 126.67]$

SST (Total Sum of Squares): $SST = \sum (y_{i} - \overset{y}{ˉ})^{2} = 123.3 3^{2} + 83.3 3^{2} + 23.3 3^{2} + 36.6 7^{2} + 66.6 7^{2} + 126.6 7^{2}$ $= 15210.4 + 6943.9 + 544.3 + 1344.7 + 4444.9 + 16045.3 = 44533.4$

Every metric below should do significantly better than this baseline. If your model has $R^{2} < 0$ , it doesn't even beat predicting the mean.

Anchor (OLS fit): $y_{true} = [180, 220, 280, 340, 370, 430]$ , $\overset{y}{^} = [183.33, 223.33, 273.33, 333.33, 373.33, 433.33]$

Residuals: $[- 3.33, - 3.33, 6.67, 6.67, - 3.33, - 3.33]$ , $\overset{y}{ˉ} = 303.33$

MAE — Mean Absolute Error

$MAE = \frac{1}{n} \sum ∣ y_{i} - \overset{y}{^}_{i} ∣$

Absolute residuals: $[3.33, 3.33, 6.67, 6.67, 3.33, 3.33]$ . Sum = 26.67.

$MAE = \frac{26.67}{6} = 4.44$

Interpretation: on average, predictions are off by $4.44k. This is in the original units of $y$ — immediately interpretable.

Outlier sensitivity: low. A single sample with residual 200 contributes 200 to the sum. A sample with residual 0.1 contributes 0.1. No squaring means no disproportionate penalty for large errors.

MSE — Mean Squared Error

$MSE = \frac{1}{n} \sum (y_{i} - \overset{y}{^}_{i})^{2}$

Squared residuals: $[11.09, 11.09, 44.49, 44.49, 11.09, 11.09]$ . Sum = 133.33.

$MSE = \frac{133.33}{6} = 22.22$

Units: squared dollars — not directly interpretable in the original scale. You can't tell a stakeholder "our average error is 22.22."

Outlier sensitivity: high. That same residual of 200 becomes 40,000 in squared error — 200× its squared magnitude, versus 200× its absolute value for MAE. MSE makes one large error hurt much more than many small errors.

Why use it? Differentiable everywhere, connects directly to the OLS derivation, and the gradient is clean. It's the loss function used to train the model.

RMSE — Root Mean Squared Error

$RMSE = MSE = 22.22 = 4.71$

RMSE is in the same units as $y$ , making it interpretable. It's always $\geq$ MAE — the gap between them indicates whether large errors dominate.

$\frac{RMSE}{MAE} = \frac{4.71}{4.44} = 1.06$

Ratio close to 1.0 means errors are uniformly distributed across samples. For a model with one catastrophically large error and five tiny ones, this ratio could be 5–10×.

R² — Coefficient of Determination

$R^{2} = 1 - \frac{SSR}{SST} = 1 - \frac{\sum ( y _{i} - y ^ _{i} ) ^{2}}{\sum ( y _{i} - y ˉ ) ^{2}}$

SSR (residual sum of squares) = 133.33 (our model's SSE)
SST = 44533.4 (baseline SSE)

$R^{2} = 1 - \frac{133.33}{44533.4} = 1 - 0.003 = 0.997$

Our model explains 99.7% of the variance in house prices. For 6 points on a near-linear relationship, this is expected. $R^{2} = 1$ means perfect fit; $R^{2} = 0$ means no better than the naive mean model; $R^{2} < 0$ means your model is worse than predicting the mean.

$R^{2}$ is unitless and scale-invariant — you can compare it across datasets with different $y$ scales. MSE and RMSE cannot be compared across datasets.

Adjusted R²

Adding any feature — even random noise — can only maintain or increase $R^{2}$ . Adjusted $R^{2}$ corrects for this:

$R_{adj}^{2} = 1 - (1 - R^{2}) \cdot \frac{n - 1}{n - p - 1}$

Simple regression ( $p = 1$ ): $R_{adj}^{2} = 1 - (1 - 0.997) \times \frac{5}{4} = 1 - 0.003 \times 1.25 = 0.996$

Multiple regression ( $p = 2$ ): $R_{adj}^{2} = 1 - (1 - 0.997) \times \frac{5}{3} = 1 - 0.005 = 0.995$

Adding bedrooms as a second feature slightly decreased adjusted $R^{2}$ from 0.996 to 0.995 — bedrooms didn't improve the model enough to justify the extra parameter. This is adjusted $R^{2}$ doing its job.

MAPE — Mean Absolute Percentage Error

$MAPE = \frac{1}{n} \sum \frac{∣ y _{i} - y ^ _{i} ∣}{y _{i}} \times 100%$

| $y_{i}$ | $\overset{y}{^}_{i}$ | $∣ ε_{i} ∣$ | $∣ ε_{i} ∣/ y_{i} \times 100$ | |---|---|---|---| | 180 | 183.33 | 3.33 | 1.85% | | 220 | 223.33 | 3.33 | 1.51% | | 280 | 273.33 | 6.67 | 2.38% | | 340 | 333.33 | 6.67 | 1.96% | | 370 | 373.33 | 3.33 | 0.90% | | 430 | 433.33 | 3.33 | 0.77% | | MAPE | | | 1.56% |

Predictions are within 1.56% of actual price on average. Business stakeholders understand percentages better than abstract MSE values.

Warning: MAPE is undefined when $y_{i} = 0$ . For demand forecasting where zero demand is common, use sMAPE or WMAPE instead.

Code Verification

python

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

y_true = np.array([180, 220, 280, 340, 370, 430])
y_pred = np.array([183.33, 223.33, 273.33, 333.33, 373.33, 433.33])

mae  = mean_absolute_error(y_true, y_pred)
mse  = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2   = r2_score(y_true, y_pred)
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

print(f"MAE:  {mae:.2f}")
print(f"MSE:  {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R²:   {r2:.4f}")
print(f"MAPE: {mape:.2f}%")

MAE:  4.44
MSE:  22.22
RMSE: 4.71
R²:   0.9970
MAPE: 1.56%

Metrics Reference

Metric	Value	Units	Outlier Sensitive?	Scale-Dependent?
MAE	4.44	$k	No	Yes
MSE	22.22	$k²	Yes	Yes
RMSE	4.71	$k	Yes (less than MSE)	Yes
$R^{2}$	0.997	—	No	No
Adjusted $R^{2}$	0.996	—	No	No
MAPE	1.56%	%	No	No

When to Use Which Metric

MAE: Use when outliers should not be heavily penalized and interpretability in original units matters. Demand forecasting, where occasional spikes are acceptable.
RMSE: Standard for most ML benchmarks. Use when large errors are particularly costly — autonomous vehicle trajectory prediction, financial risk modeling.
$R^{2}$ : Use to compare models on the same dataset. Never compare $R^{2}$ across datasets with different $y$ scales.
MAPE: Use for business stakeholders who need a percentage — "our model is within 2% of actual price." Fails when $y$ can be zero.

$R^{2} = 0.997$ on this 6-sample dataset is misleadingly high — six points on a near-linear relationship will always show a high $R^{2}$ . On a real dataset with 10,000 samples, $R^{2} = 0.60$ might be excellent. Never use $R^{2}$ to compare models across different datasets or different $y$ targets.

The deeper limitation: all metrics here are average-case measures. A model that's highly accurate for 95% of samples but catastrophically wrong for 5% can still show a low RMSE. For high-stakes applications, examine the worst-case error distribution — not just the mean.

Test Your Understanding

If you added a constant 10 to every prediction ( $\overset{y}{^}_{i} \leftarrow \overset{y}{^}_{i} + 10$ ), how would MAE, MSE, and $R^{2}$ each change?
Two models: Model A has MAE = 5.0, RMSE = 5.1. Model B has MAE = 3.0, RMSE = 9.0. Which model has larger individual errors, and when would you prefer Model B despite its higher RMSE?
Adjusted $R^{2}$ decreased when we added bedrooms (0.996 → 0.995). How many additional features could you add before adjusted $R^{2}$ would reach zero?
MAPE can be undefined. Construct a dataset where MAPE gives a misleading evaluation even when $y_{i} \neq = 0$ for all $i$ .
Suppose the test set has one house listed at $2000k (a data entry error). How does this single outlier affect RMSE vs MAE? Which metric should you trust for model evaluation, and why?

Performance Metrics for Regression

Baseline: The Naive Model

MAE — Mean Absolute Error

MSE — Mean Squared Error

RMSE — Root Mean Squared Error

R² — Coefficient of Determination

Adjusted R²

MAPE — Mean Absolute Percentage Error

Code Verification

Metrics Reference

When to Use Which Metric

Test Your Understanding

Comments (0)

Leave a comment

Performance Metrics for Regression

Baseline: The Naive Model

MAE — Mean Absolute Error

MSE — Mean Squared Error

RMSE — Root Mean Squared Error

R² — Coefficient of Determination

Adjusted R²

MAPE — Mean Absolute Percentage Error

Code Verification

Metrics Reference

When to Use Which Metric

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment