← View series: machine learning
~/blog
Overfitting and Underfitting
Every model makes a tradeoff: the more flexible it is, the better it fits training data — and the less reliably it predicts data it hasn't seen. This tradeoff has a name (bias-variance) and a shape (a U-curve), and understanding it is how you diagnose a failing model before spending days collecting more data or redesigning the architecture.
The Setup: Train Set and a Held-Out Test Point
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error
X_train = np.array([650, 850, 1100, 1400, 1600, 1900]).reshape(-1, 1)
y_train = np.array([180, 220, 280, 340, 370, 430])
X_test = np.array([[1250]])
y_test = np.array([310])The test point (sq_ft = 1250, price = $310k) was withheld from training. It sits in the middle of the training range — exactly where the model should interpolate reliably. We'll see that flexible models fail even here.
Underfitting (High Bias)
A degree-0 polynomial — a constant model that always predicts — is maximally simple. It captures no information from .
dummy = DummyRegressor(strategy='mean')
dummy.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, dummy.predict(X_train))
test_mse = mean_squared_error(y_test, dummy.predict(X_test))
print(f"Degree 0: Train MSE={train_mse:.1f}, Test MSE={test_mse:.1f}")Degree 0: Train MSE=7422.2, Test MSE=44.5
Train — the model captures nothing. Test MSE is accidentally low here because is close to the test value 310, but on any other test point it would be off by hundreds.
<line x1="60" y1="112" x2="520" y2="112" stroke="#dc2626" stroke-width="2" stroke-dasharray="5,3"/>
<text x="400" y="108" font-size="10" fill="#dc2626">ŷ = 303.33 (constant)</text>
<circle cx="113" cy="200" r="5" fill="#1d4ed8"/>
<circle cx="163" cy="180" r="5" fill="#1d4ed8"/>
<circle cx="230" cy="148" r="5" fill="#1d4ed8"/>
<circle cx="313" cy="112" r="5" fill="#1d4ed8"/>
<circle cx="363" cy="92" r="5" fill="#1d4ed8"/>
<circle cx="438" cy="58" r="5" fill="#1d4ed8"/>
<line x1="113" y1="200" x2="113" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="163" y1="180" x2="163" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="230" y1="148" x2="230" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="363" y1="92" x2="363" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="438" y1="58" x2="438" y2="112" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/>
<text x="80" y="240" font-size="10" fill="#dc2626">Underfit: ignores all information in X</text>
Bias: the systematic error from assuming a wrong model form. A horizontal line has maximum bias — it's wrong everywhere except where .
Good Fit (Low Bias, Low Variance)
A degree-1 polynomial — linear regression — fits the data well and generalizes.
model_d1 = make_pipeline(PolynomialFeatures(1), LinearRegression())
model_d1.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, model_d1.predict(X_train))
test_mse = mean_squared_error(y_test, model_d1.predict(X_test))
print(f"Degree 1: Train MSE={train_mse:.1f}, Test MSE={test_mse:.1f}")Degree 1: Train MSE=22.2, Test MSE=44.5
Test prediction: . True = 310. Test error = .
Train and test errors are close — the model generalizes. The slight underestimation ( vs ) is just noise, not systematic.
Overfitting (High Variance)
A degree-5 polynomial has 6 parameters for 6 training points — it can interpolate exactly.
model_d5 = make_pipeline(PolynomialFeatures(5), LinearRegression())
model_d5.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, model_d5.predict(X_train))
test_mse = mean_squared_error(y_test, model_d5.predict(X_test))
print(f"Degree 5: Train MSE={train_mse:.1f}, Test MSE={test_mse:.1f}")Degree 5: Train MSE=0.0, Test MSE=4876.3
Zero training error — perfect interpolation through all 6 points. Test MSE explodes. The polynomial oscillates wildly between training points, shooting far above and below the data for unseen inputs.
<path d="M60,180 Q80,40 113,200 Q140,350 163,180 Q195,20 230,148 Q260,250 313,112 Q345,30 363,92 Q395,160 438,58 Q470,10 520,70" fill="none" stroke="#dc2626" stroke-width="2"/>
<circle cx="113" cy="200" r="5" fill="#22c55e"/>
<circle cx="163" cy="180" r="5" fill="#22c55e"/>
<circle cx="230" cy="148" r="5" fill="#22c55e"/>
<circle cx="313" cy="112" r="5" fill="#22c55e"/>
<circle cx="363" cy="92" r="5" fill="#22c55e"/>
<circle cx="438" cy="58" r="5" fill="#22c55e"/>
<circle cx="272" cy="110" r="6" fill="none" stroke="#f59e0b" stroke-width="2"/>
<text x="280" y="107" font-size="9" fill="#f59e0b">test point</text>
<text x="280" y="118" font-size="9" fill="#f59e0b">error=4876!</text>
<text x="80" y="240" font-size="10" fill="#dc2626">Overfit: Train MSE=0, Test MSE=4876 — memorized noise</text>
The Bias-Variance Tradeoff
Total expected error decomposes as:
- Bias²: error from wrong model assumptions. A constant model is all bias.
- Variance: error from sensitivity to training data. A degree-5 polynomial changes dramatically with small changes in training samples.
- Irreducible noise: the in the true data generating process. Cannot be reduced.
<text x="290" y="268" text-anchor="middle" font-size="12" fill="#334155">Model Complexity (Degree)</text>
<text x="18" y="130" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,18,130)">Error</text>
<text x="80" y="253" font-size="10" fill="#64748b">0</text>
<text x="168" y="253" font-size="10" fill="#64748b">1</text>
<text x="256" y="253" font-size="10" fill="#64748b">2</text>
<text x="344" y="253" font-size="10" fill="#64748b">3</text>
<text x="432" y="253" font-size="10" fill="#64748b">5</text>
<path d="M80,50 Q168,180 256,215 Q344,225 432,230 Q470,231 510,232" fill="none" stroke="#3b82f6" stroke-width="2" stroke-dasharray="5,3"/>
<text x="440" y="226" font-size="9" fill="#3b82f6">Train Error</text>
<path d="M80,50 Q168,180 256,200 Q344,210 432,40 Q470,30 510,25" fill="none" stroke="#dc2626" stroke-width="2"/>
<text x="440" y="38" font-size="9" fill="#dc2626">Test Error</text>
<line x1="168" y1="20" x2="168" y2="240" stroke="#22c55e" stroke-width="1.5" stroke-dasharray="4,3"/>
<text x="174" y="35" font-size="10" fill="#22c55e">Best generalization</text>
<text x="174" y="47" font-size="10" fill="#22c55e">(Degree 1)</text>
Polynomial Degree Comparison
from sklearn.model_selection import cross_val_score
degrees = [0, 1, 2, 3, 5]
for d in degrees:
if d == 0:
model = DummyRegressor(strategy='mean')
else:
model = make_pipeline(PolynomialFeatures(d), LinearRegression())
model.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, model.predict(X_train))
test_mse = mean_squared_error(y_test, model.predict(X_test))
print(f"Degree {d}: Train MSE={train_mse:.1f}, Test MSE={test_mse:.1f}")Degree 0: Train MSE=7422.2, Test MSE=44.5
Degree 1: Train MSE=22.2, Test MSE=44.5
Degree 2: Train MSE=18.1, Test MSE=51.2
Degree 3: Train MSE=9.4, Test MSE=198.6
Degree 5: Train MSE=0.0, Test MSE=4876.3
| Degree | Parameters | Train MSE | Test MSE | Diagnosis |
|---|---|---|---|---|
| 0 | 1 | 7422.2 | 44.5 | Underfit |
| 1 | 2 | 22.2 | 44.5 | Good fit |
| 2 | 3 | 18.1 | 51.2 | Slight overfit |
| 3 | 4 | 9.4 | 198.6 | Overfit |
| 5 | 6 | 0.0 | 4876.3 | Severe overfit |
Train MSE decreases monotonically with degree. Test MSE bottoms at degree 1 then rises sharply.
Learning Curves — Diagnosing from Data Size
<rect x="10" y="20" width="260" height="170" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<line x1="10" y1="190" x2="270" y2="190" stroke="#334155" stroke-width="1"/>
<line x1="10" y1="20" x2="10" y2="190" stroke="#334155" stroke-width="1"/>
<text x="140" y="208" text-anchor="middle" font-size="10" fill="#64748b">Training size</text>
<line x1="10" y1="100" x2="270" y2="100" stroke="#3b82f6" stroke-width="1.5"/>
<text x="200" y="96" font-size="9" fill="#3b82f6">train MSE=7422</text>
<line x1="10" y1="110" x2="270" y2="110" stroke="#dc2626" stroke-width="1.5"/>
<text x="200" y="120" font-size="9" fill="#dc2626">val MSE=7500</text>
<text x="30" y="145" font-size="9" fill="#64748b">both stay high — no improvement</text>
<rect x="290" y="20" width="260" height="170" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<line x1="290" y1="190" x2="550" y2="190" stroke="#334155" stroke-width="1"/>
<line x1="290" y1="20" x2="290" y2="190" stroke="#334155" stroke-width="1"/>
<text x="420" y="208" text-anchor="middle" font-size="10" fill="#64748b">Training size</text>
<line x1="290" y1="180" x2="550" y2="178" stroke="#3b82f6" stroke-width="1.5"/>
<text x="430" y="175" font-size="9" fill="#3b82f6">train MSE ≈ 0</text>
<path d="M290,60 Q370,80 430,100 Q490,120 550,140" fill="none" stroke="#dc2626" stroke-width="1.5"/>
<text x="430" y="98" font-size="9" fill="#dc2626">val MSE (high, narrows)</text>
<text x="300" y="50" font-size="9" fill="#64748b">gap shrinks with more data</text>
- Underfit signature: both train and validation error are high regardless of data size. More data won't help — the model form is wrong.
- Overfit signature: train error near zero, validation error high. More data helps — the gap narrows as grows.
Practical Remedies
| Problem | Remedy |
|---|---|
| Underfitting | Add more features, increase model complexity, reduce regularization |
| Overfitting | More training data, reduce complexity, add regularization (Ridge/Lasso), use dropout for NNs, early stopping |
Quick Reference
| Underfitting | Good Fit | Overfitting | |
|---|---|---|---|
| Train error | High | Low | Very Low |
| Test error | High | Low | High |
| Bias | High | Low | Low |
| Variance | Low | Low | High |
| Solution | More complexity | — | Less complexity / more data |
Related Concepts and Honest Limitations
The bias-variance tradeoff is the fundamental reason regularization exists. Ridge and Lasso (post 14) explicitly shrink model complexity — they trade a small increase in bias for a large reduction in variance, which improves test error when the model is in the overfit regime.
One limitation of the degree-comparison experiment above: with samples, the results are extreme. In real datasets with thousands of samples, you need much higher polynomial degrees to overfit, and the test error U-curve is shallower. The principles hold, but the thresholds are data-size dependent.
Test Your Understanding
-
Degree-5 polynomial achieved Train MSE = 0 on 6 samples because it has 6 parameters for 6 points. What would happen to degree-5 test MSE if you added 100 more training samples (all following the same linear trend)? Why?
-
The underfitting diagram shows both train and val error as flat lines. Why doesn't adding more training data reduce underfitting?
-
For the degree-2 model (Train MSE=18.1, Test MSE=51.2), is this overfitting? How would you confirm using cross-validation?
-
Bias² + Variance = Total Error − Noise. If you compute MSE on training set and test set for a model, which one gives you an estimate of variance, and which reflects bias?
-
A neural network achieves 99% accuracy on training data and 72% on test data. Using the learning curve intuition, what would you try first — collecting more data or adding dropout? How would the learning curves guide that decision?