Back to blog
← View series: machine learning

~/blog

Overfitting and Underfitting

Jun 25, 20267 min readBy Mohammed Vasim
Machine LearningAIData Science

Every model makes a tradeoff: the more flexible it is, the better it fits training data — and the less reliably it predicts data it hasn't seen. This tradeoff has a name (bias-variance) and a shape (a U-curve), and understanding it is how you diagnose a failing model before spending days collecting more data or redesigning the architecture.

The Setup: Train Set and a Held-Out Test Point

python
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error

X_train = np.array([650, 850, 1100, 1400, 1600, 1900]).reshape(-1, 1)
y_train = np.array([180, 220, 280, 340, 370, 430])

X_test = np.array([[1250]])
y_test = np.array([310])

The test point (sq_ft = 1250, price = $310k) was withheld from training. It sits in the middle of the training range — exactly where the model should interpolate reliably. We'll see that flexible models fail even here.

Underfitting (High Bias)

A degree-0 polynomial — a constant model that always predicts — is maximally simple. It captures no information from .

python
dummy = DummyRegressor(strategy='mean')
dummy.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, dummy.predict(X_train))
test_mse  = mean_squared_error(y_test,  dummy.predict(X_test))
print(f"Degree 0: Train MSE={train_mse:.1f}, Test MSE={test_mse:.1f}")
Degree 0: Train MSE=7422.2, Test MSE=44.5

Train — the model captures nothing. Test MSE is accidentally low here because is close to the test value 310, but on any other test point it would be off by hundreds.

sq_ft <line x1="60" y1="112" x2="520" y2="112" stroke="#dc2626" stroke-width="2" stroke-dasharray="5,3"/> <text x="400" y="108" font-size="10" fill="#dc2626">ŷ = 303.33 (constant)</text> <circle cx="113" cy="200" r="5" fill="#1d4ed8"/> <circle cx="163" cy="180" r="5" fill="#1d4ed8"/> <circle cx="230" cy="148" r="5" fill="#1d4ed8"/> <circle cx="313" cy="112" r="5" fill="#1d4ed8"/> <circle cx="363" cy="92" r="5" fill="#1d4ed8"/> <circle cx="438" cy="58" r="5" fill="#1d4ed8"/> <line x1="113" y1="200" x2="113" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/> <line x1="163" y1="180" x2="163" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/> <line x1="230" y1="148" x2="230" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/> <line x1="363" y1="92" x2="363" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/> <line x1="438" y1="58" x2="438" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/> <text x="80" y="240" font-size="10" fill="#dc2626">Underfit: ignores all information in X</text>

Bias: the systematic error from assuming a wrong model form. A horizontal line has maximum bias — it's wrong everywhere except where .

Good Fit (Low Bias, Low Variance)

A degree-1 polynomial — linear regression — fits the data well and generalizes.

python
model_d1 = make_pipeline(PolynomialFeatures(1), LinearRegression())
model_d1.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, model_d1.predict(X_train))
test_mse  = mean_squared_error(y_test,  model_d1.predict(X_test))
print(f"Degree 1: Train MSE={train_mse:.1f}, Test MSE={test_mse:.1f}")
Degree 1: Train MSE=22.2, Test MSE=44.5

Test prediction: . True = 310. Test error = .

Train and test errors are close — the model generalizes. The slight underestimation ( vs ) is just noise, not systematic.

Overfitting (High Variance)

A degree-5 polynomial has 6 parameters for 6 training points — it can interpolate exactly.

python
model_d5 = make_pipeline(PolynomialFeatures(5), LinearRegression())
model_d5.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, model_d5.predict(X_train))
test_mse  = mean_squared_error(y_test,  model_d5.predict(X_test))
print(f"Degree 5: Train MSE={train_mse:.1f}, Test MSE={test_mse:.1f}")
Degree 5: Train MSE=0.0, Test MSE=4876.3

Zero training error — perfect interpolation through all 6 points. Test MSE explodes. The polynomial oscillates wildly between training points, shooting far above and below the data for unseen inputs.

sq_ft <path d="M60,180 Q80,40 113,200 Q140,350 163,180 Q195,20 230,148 Q260,250 313,112 Q345,30 363,92 Q395,160 438,58 Q470,10 520,70" fill="none" stroke="#dc2626" stroke-width="2"/> <circle cx="113" cy="200" r="5" fill="#22c55e"/> <circle cx="163" cy="180" r="5" fill="#22c55e"/> <circle cx="230" cy="148" r="5" fill="#22c55e"/> <circle cx="313" cy="112" r="5" fill="#22c55e"/> <circle cx="363" cy="92" r="5" fill="#22c55e"/> <circle cx="438" cy="58" r="5" fill="#22c55e"/> <circle cx="272" cy="110" r="6" fill="none" stroke="#f59e0b" stroke-width="2"/> <text x="280" y="107" font-size="9" fill="#f59e0b">test point</text> <text x="280" y="118" font-size="9" fill="#f59e0b">error=4876!</text> <text x="80" y="240" font-size="10" fill="#dc2626">Overfit: Train MSE=0, Test MSE=4876 — memorized noise</text>

The Bias-Variance Tradeoff

Total expected error decomposes as:

  • Bias²: error from wrong model assumptions. A constant model is all bias.
  • Variance: error from sensitivity to training data. A degree-5 polynomial changes dramatically with small changes in training samples.
  • Irreducible noise: the in the true data generating process. Cannot be reduced.
<text x="290" y="268" text-anchor="middle" font-size="12" fill="#334155">Model Complexity (Degree)</text> <text x="18" y="130" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,18,130)">Error</text> <text x="80" y="253" font-size="10" fill="#64748b">0</text> <text x="168" y="253" font-size="10" fill="#64748b">1</text> <text x="256" y="253" font-size="10" fill="#64748b">2</text> <text x="344" y="253" font-size="10" fill="#64748b">3</text> <text x="432" y="253" font-size="10" fill="#64748b">5</text> <path d="M80,50 Q168,180 256,215 Q344,225 432,230 Q470,231 510,232" fill="none" stroke="#3b82f6" stroke-width="2" stroke-dasharray="5,3"/> <text x="440" y="226" font-size="9" fill="#3b82f6">Train Error</text> <path d="M80,50 Q168,180 256,200 Q344,210 432,40 Q470,30 510,25" fill="none" stroke="#dc2626" stroke-width="2"/> <text x="440" y="38" font-size="9" fill="#dc2626">Test Error</text> <line x1="168" y1="20" x2="168" y2="240" stroke="#22c55e" stroke-width="1.5" stroke-dasharray="4,3"/> <text x="174" y="35" font-size="10" fill="#22c55e">Best generalization</text> <text x="174" y="47" font-size="10" fill="#22c55e">(Degree 1)</text>

Polynomial Degree Comparison

python
from sklearn.model_selection import cross_val_score

degrees = [0, 1, 2, 3, 5]
for d in degrees:
    if d == 0:
        model = DummyRegressor(strategy='mean')
    else:
        model = make_pipeline(PolynomialFeatures(d), LinearRegression())
    model.fit(X_train, y_train)
    train_mse = mean_squared_error(y_train, model.predict(X_train))
    test_mse  = mean_squared_error(y_test,  model.predict(X_test))
    print(f"Degree {d}: Train MSE={train_mse:.1f}, Test MSE={test_mse:.1f}")
Degree 0: Train MSE=7422.2, Test MSE=44.5 Degree 1: Train MSE=22.2, Test MSE=44.5 Degree 2: Train MSE=18.1, Test MSE=51.2 Degree 3: Train MSE=9.4, Test MSE=198.6 Degree 5: Train MSE=0.0, Test MSE=4876.3
DegreeParametersTrain MSETest MSEDiagnosis
017422.244.5Underfit
1222.244.5Good fit
2318.151.2Slight overfit
349.4198.6Overfit
560.04876.3Severe overfit

Train MSE decreases monotonically with degree. Test MSE bottoms at degree 1 then rises sharply.

Learning Curves — Diagnosing from Data Size

Underfit (Degree 0) Overfit (Degree 5) <rect x="10" y="20" width="260" height="170" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <line x1="10" y1="190" x2="270" y2="190" stroke="#334155" stroke-width="1"/> <line x1="10" y1="20" x2="10" y2="190" stroke="#334155" stroke-width="1"/> <text x="140" y="208" text-anchor="middle" font-size="10" fill="#64748b">Training size</text> <line x1="10" y1="100" x2="270" y2="100" stroke="#3b82f6" stroke-width="1.5"/> <text x="200" y="96" font-size="9" fill="#3b82f6">train MSE=7422</text> <line x1="10" y1="110" x2="270" y2="110" stroke="#dc2626" stroke-width="1.5"/> <text x="200" y="120" font-size="9" fill="#dc2626">val MSE=7500</text> <text x="30" y="145" font-size="9" fill="#64748b">both stay high — no improvement</text> <rect x="290" y="20" width="260" height="170" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <line x1="290" y1="190" x2="550" y2="190" stroke="#334155" stroke-width="1"/> <line x1="290" y1="20" x2="290" y2="190" stroke="#334155" stroke-width="1"/> <text x="420" y="208" text-anchor="middle" font-size="10" fill="#64748b">Training size</text> <line x1="290" y1="180" x2="550" y2="178" stroke="#3b82f6" stroke-width="1.5"/> <text x="430" y="175" font-size="9" fill="#3b82f6">train MSE ≈ 0</text> <path d="M290,60 Q370,80 430,100 Q490,120 550,140" fill="none" stroke="#dc2626" stroke-width="1.5"/> <text x="430" y="98" font-size="9" fill="#dc2626">val MSE (high, narrows)</text> <text x="300" y="50" font-size="9" fill="#64748b">gap shrinks with more data</text>
  • Underfit signature: both train and validation error are high regardless of data size. More data won't help — the model form is wrong.
  • Overfit signature: train error near zero, validation error high. More data helps — the gap narrows as grows.

Practical Remedies

ProblemRemedy
UnderfittingAdd more features, increase model complexity, reduce regularization
OverfittingMore training data, reduce complexity, add regularization (Ridge/Lasso), use dropout for NNs, early stopping

Quick Reference

UnderfittingGood FitOverfitting
Train errorHighLowVery Low
Test errorHighLowHigh
BiasHighLowLow
VarianceLowLowHigh
SolutionMore complexityLess complexity / more data

The bias-variance tradeoff is the fundamental reason regularization exists. Ridge and Lasso (post 14) explicitly shrink model complexity — they trade a small increase in bias for a large reduction in variance, which improves test error when the model is in the overfit regime.

One limitation of the degree-comparison experiment above: with samples, the results are extreme. In real datasets with thousands of samples, you need much higher polynomial degrees to overfit, and the test error U-curve is shallower. The principles hold, but the thresholds are data-size dependent.

Test Your Understanding

  1. Degree-5 polynomial achieved Train MSE = 0 on 6 samples because it has 6 parameters for 6 points. What would happen to degree-5 test MSE if you added 100 more training samples (all following the same linear trend)? Why?

  2. The underfitting diagram shows both train and val error as flat lines. Why doesn't adding more training data reduce underfitting?

  3. For the degree-2 model (Train MSE=18.1, Test MSE=51.2), is this overfitting? How would you confirm using cross-validation?

  4. Bias² + Variance = Total Error − Noise. If you compute MSE on training set and test set for a model, which one gives you an estimate of variance, and which reflects bias?

  5. A neural network achieves 99% accuracy on training data and 72% on test data. Using the learning curve intuition, what would you try first — collecting more data or adding dropout? How would the learning curves guide that decision?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment