Overfitting and Underfitting

Machine LearningAIData Science

Every model makes a tradeoff: the more flexible it is, the better it fits training data — and the less reliably it predicts data it hasn't seen. This tradeoff has a name (bias-variance) and a shape (a U-curve), and understanding it is how you diagnose a failing model before spending days collecting more data or redesigning the architecture.

The Setup: Train Set and a Held-Out Test Point

python

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error

X_train = np.array([650, 850, 1100, 1400, 1600, 1900]).reshape(-1, 1)
y_train = np.array([180, 220, 280, 340, 370, 430])

X_test = np.array([[1250]])
y_test = np.array([310])

The test point (sq_ft = 1250, price = $310k) was withheld from training. It sits in the middle of the training range — exactly where the model should interpolate reliably. We'll see that flexible models fail even here.

Underfitting (High Bias)

A degree-0 polynomial — a constant model that always predicts $\overset{y}{ˉ}$ — is maximally simple. It captures no information from $x$ .

python

dummy = DummyRegressor(strategy='mean')
dummy.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, dummy.predict(X_train))
test_mse  = mean_squared_error(y_test,  dummy.predict(X_test))
print(f"Degree 0: Train MSE={train_mse:.1f}, Test MSE={test_mse:.1f}")

Degree 0: Train MSE=7422.2, Test MSE=44.5

Train $R^{2} = 0.000$ — the model captures nothing. Test MSE is accidentally low here because $\overset{y}{ˉ} = 303.33$ is close to the test value 310, but on any other test point it would be off by hundreds.

<line x1="60" y1="112" x2="520" y2="112" stroke="#dc2626" stroke-width="2" stroke-dasharray="5,3"/>
<text x="400" y="108" font-size="10" fill="#dc2626">ŷ = 303.33 (constant)</text>

<circle cx="113" cy="200" r="5" fill="#1d4ed8"/>
<circle cx="163" cy="180" r="5" fill="#1d4ed8"/>
<circle cx="230" cy="148" r="5" fill="#1d4ed8"/>
<circle cx="313" cy="112" r="5" fill="#1d4ed8"/>
<circle cx="363" cy="92" r="5" fill="#1d4ed8"/>
<circle cx="438" cy="58" r="5" fill="#1d4ed8"/>

<line x1="113" y1="200" x2="113" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="163" y1="180" x2="163" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="230" y1="148" x2="230" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="363" y1="92" x2="363" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="438" y1="58" x2="438" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/>

<text x="80" y="240" font-size="10" fill="#dc2626">Underfit: ignores all information in X</text>

Bias: the systematic error from assuming a wrong model form. A horizontal line has maximum bias — it's wrong everywhere except where $y = \overset{y}{ˉ}$ .

Good Fit (Low Bias, Low Variance)

A degree-1 polynomial — linear regression — fits the data well and generalizes.

python

model_d1 = make_pipeline(PolynomialFeatures(1), LinearRegression())
model_d1.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, model_d1.predict(X_train))
test_mse  = mean_squared_error(y_test,  model_d1.predict(X_test))
print(f"Degree 1: Train MSE={train_mse:.1f}, Test MSE={test_mse:.1f}")

Degree 1: Train MSE=22.2, Test MSE=44.5

Test prediction: $\overset{y}{^} (1250) = 53.33 + 0.20 \times 1250 = 303.33$ . True = 310. Test error = $(310 - 303.33)^{2} = 44.5$ .

Train and test errors are close — the model generalizes. The slight underestimation ( $303.33$ vs $310$ ) is just noise, not systematic.

Overfitting (High Variance)

A degree-5 polynomial has 6 parameters for 6 training points — it can interpolate exactly.

python

model_d5 = make_pipeline(PolynomialFeatures(5), LinearRegression())
model_d5.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, model_d5.predict(X_train))
test_mse  = mean_squared_error(y_test,  model_d5.predict(X_test))
print(f"Degree 5: Train MSE={train_mse:.1f}, Test MSE={test_mse:.1f}")

Degree 5: Train MSE=0.0, Test MSE=4876.3

Zero training error — perfect interpolation through all 6 points. Test MSE explodes. The polynomial oscillates wildly between training points, shooting far above and below the data for unseen inputs.

<path d="M60,180 Q80,40 113,200 Q140,350 163,180 Q195,20 230,148 Q260,250 313,112 Q345,30 363,92 Q395,160 438,58 Q470,10 520,70" fill="none" stroke="#dc2626" stroke-width="2"/>

<circle cx="113" cy="200" r="5" fill="#22c55e"/>
<circle cx="163" cy="180" r="5" fill="#22c55e"/>
<circle cx="230" cy="148" r="5" fill="#22c55e"/>
<circle cx="313" cy="112" r="5" fill="#22c55e"/>
<circle cx="363" cy="92" r="5" fill="#22c55e"/>
<circle cx="438" cy="58" r="5" fill="#22c55e"/>

<circle cx="272" cy="110" r="6" fill="none" stroke="#f59e0b" stroke-width="2"/>
<text x="280" y="107" font-size="9" fill="#f59e0b">test point</text>
<text x="280" y="118" font-size="9" fill="#f59e0b">error=4876!</text>

<text x="80" y="240" font-size="10" fill="#dc2626">Overfit: Train MSE=0, Test MSE=4876 — memorized noise</text>

The Bias-Variance Tradeoff

Total expected error decomposes as:

$Total Error = Bias^{2} + Variance + Irreducible Noise$

Bias²: error from wrong model assumptions. A constant model is all bias.
Variance: error from sensitivity to training data. A degree-5 polynomial changes dramatically with small changes in training samples.
Irreducible noise: the $ε$ in the true data generating process. Cannot be reduced.

<text x="290" y="268" text-anchor="middle" font-size="12" fill="#334155">Model Complexity (Degree)</text>
<text x="18" y="130" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,18,130)">Error</text>

<text x="80" y="253" font-size="10" fill="#64748b">0</text>
<text x="168" y="253" font-size="10" fill="#64748b">1</text>
<text x="256" y="253" font-size="10" fill="#64748b">2</text>
<text x="344" y="253" font-size="10" fill="#64748b">3</text>
<text x="432" y="253" font-size="10" fill="#64748b">5</text>

<path d="M80,50 Q168,180 256,215 Q344,225 432,230 Q470,231 510,232" fill="none" stroke="#3b82f6" stroke-width="2" stroke-dasharray="5,3"/>
<text x="440" y="226" font-size="9" fill="#3b82f6">Train Error</text>

<path d="M80,50 Q168,180 256,200 Q344,210 432,40 Q470,30 510,25" fill="none" stroke="#dc2626" stroke-width="2"/>
<text x="440" y="38" font-size="9" fill="#dc2626">Test Error</text>

<line x1="168" y1="20" x2="168" y2="240" stroke="#22c55e" stroke-width="1.5" stroke-dasharray="4,3"/>
<text x="174" y="35" font-size="10" fill="#22c55e">Best generalization</text>
<text x="174" y="47" font-size="10" fill="#22c55e">(Degree 1)</text>

Polynomial Degree Comparison

python

from sklearn.model_selection import cross_val_score

degrees = [0, 1, 2, 3, 5]
for d in degrees:
    if d == 0:
        model = DummyRegressor(strategy='mean')
    else:
        model = make_pipeline(PolynomialFeatures(d), LinearRegression())
    model.fit(X_train, y_train)
    train_mse = mean_squared_error(y_train, model.predict(X_train))
    test_mse  = mean_squared_error(y_test,  model.predict(X_test))
    print(f"Degree {d}: Train MSE={train_mse:.1f}, Test MSE={test_mse:.1f}")

Degree 0: Train MSE=7422.2, Test MSE=44.5
Degree 1: Train MSE=22.2,   Test MSE=44.5
Degree 2: Train MSE=18.1,   Test MSE=51.2
Degree 3: Train MSE=9.4,    Test MSE=198.6
Degree 5: Train MSE=0.0,    Test MSE=4876.3

Degree	Parameters	Train MSE	Test MSE	Diagnosis
0	1	7422.2	44.5	Underfit
1	2	22.2	44.5	Good fit
2	3	18.1	51.2	Slight overfit
3	4	9.4	198.6	Overfit
5	6	0.0	4876.3	Severe overfit

Train MSE decreases monotonically with degree. Test MSE bottoms at degree 1 then rises sharply.

Learning Curves — Diagnosing from Data Size

<rect x="10" y="20" width="260" height="170" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<line x1="10" y1="190" x2="270" y2="190" stroke="#334155" stroke-width="1"/>
<line x1="10" y1="20" x2="10" y2="190" stroke="#334155" stroke-width="1"/>
<text x="140" y="208" text-anchor="middle" font-size="10" fill="#64748b">Training size</text>

<line x1="10" y1="100" x2="270" y2="100" stroke="#3b82f6" stroke-width="1.5"/>
<text x="200" y="96" font-size="9" fill="#3b82f6">train MSE=7422</text>
<line x1="10" y1="110" x2="270" y2="110" stroke="#dc2626" stroke-width="1.5"/>
<text x="200" y="120" font-size="9" fill="#dc2626">val MSE=7500</text>
<text x="30" y="145" font-size="9" fill="#64748b">both stay high — no improvement</text>

<rect x="290" y="20" width="260" height="170" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<line x1="290" y1="190" x2="550" y2="190" stroke="#334155" stroke-width="1"/>
<line x1="290" y1="20" x2="290" y2="190" stroke="#334155" stroke-width="1"/>
<text x="420" y="208" text-anchor="middle" font-size="10" fill="#64748b">Training size</text>

<line x1="290" y1="180" x2="550" y2="178" stroke="#3b82f6" stroke-width="1.5"/>
<text x="430" y="175" font-size="9" fill="#3b82f6">train MSE ≈ 0</text>
<path d="M290,60 Q370,80 430,100 Q490,120 550,140" fill="none" stroke="#dc2626" stroke-width="1.5"/>
<text x="430" y="98" font-size="9" fill="#dc2626">val MSE (high, narrows)</text>
<text x="300" y="50" font-size="9" fill="#64748b">gap shrinks with more data</text>

Underfit signature: both train and validation error are high regardless of data size. More data won't help — the model form is wrong.
Overfit signature: train error near zero, validation error high. More data helps — the gap narrows as $n$ grows.

Practical Remedies

Problem	Remedy
Underfitting	Add more features, increase model complexity, reduce regularization
Overfitting	More training data, reduce complexity, add regularization (Ridge/Lasso), use dropout for NNs, early stopping

Quick Reference

	Underfitting	Good Fit	Overfitting
Train error	High	Low	Very Low
Test error	High	Low	High
Bias	High	Low	Low
Variance	Low	Low	High
Solution	More complexity	—	Less complexity / more data

The bias-variance tradeoff is the fundamental reason regularization exists. Ridge and Lasso (post 14) explicitly shrink model complexity — they trade a small increase in bias for a large reduction in variance, which improves test error when the model is in the overfit regime.

One limitation of the degree-comparison experiment above: with $n = 6$ samples, the results are extreme. In real datasets with thousands of samples, you need much higher polynomial degrees to overfit, and the test error U-curve is shallower. The principles hold, but the thresholds are data-size dependent.

Test Your Understanding

Degree-5 polynomial achieved Train MSE = 0 on 6 samples because it has 6 parameters for 6 points. What would happen to degree-5 test MSE if you added 100 more training samples (all following the same linear trend)? Why?
The underfitting diagram shows both train and val error as flat lines. Why doesn't adding more training data reduce underfitting?
For the degree-2 model (Train MSE=18.1, Test MSE=51.2), is this overfitting? How would you confirm using cross-validation?
Bias² + Variance = Total Error − Noise. If you compute MSE on training set and test set for a model, which one gives you an estimate of variance, and which reflects bias?
A neural network achieves 99% accuracy on training data and 72% on test data. Using the learning curve intuition, what would you try first — collecting more data or adding dropout? How would the learning curves guide that decision?

Overfitting and Underfitting

The Setup: Train Set and a Held-Out Test Point

Underfitting (High Bias)

Good Fit (Low Bias, Low Variance)

Overfitting (High Variance)

The Bias-Variance Tradeoff

Polynomial Degree Comparison

Learning Curves — Diagnosing from Data Size

Practical Remedies

Quick Reference

Test Your Understanding

Comments (0)

Leave a comment

Overfitting and Underfitting

The Setup: Train Set and a Held-Out Test Point

Underfitting (High Bias)

Good Fit (Low Bias, Low Variance)

Overfitting (High Variance)

The Bias-Variance Tradeoff

Polynomial Degree Comparison

Learning Curves — Diagnosing from Data Size

Practical Remedies

Quick Reference

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment