← View series: machine learning
~/blog
Polynomial Regression
When the relationship between input and output bends, a straight line fails — not because linear regression is wrong, but because the feature set is wrong. Polynomial regression fixes this by engineering new features from the original ones. The algorithm doesn't change. The loss function doesn't change. OLS still applies. What changes is the design matrix, and that single insight unlocks an entire class of non-linear models without leaving the linear regression framework.
Anchor dataset: a material experiment measuring temperature rise (, in °C) as a function of applied area (, in cm²). The physical relationship is roughly quadratic — heat scales with area, not linearly.
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
X_raw = np.array([1, 2, 3, 4, 5, 6, 7, 8]).reshape(-1, 1)
y = np.array([1.5, 3.2, 5.9, 9.1, 12.8, 17.5, 22.4, 29.1])Why Linear Regression Fails Here
Fit a straight line first:
model_d1 = LinearRegression()
model_d1.fit(X_raw, y)
y_pred_d1 = model_d1.predict(X_raw)
mse_d1 = mean_squared_error(y, y_pred_d1)
print(f"Degree 1 — Train MSE: {mse_d1:.4f}")Degree 1 — Train MSE: 7.6452
A high MSE alone doesn't prove underfit — you need to look at the residual pattern. Compute them:
| (d=1) | |||
|---|---|---|---|
| 1 | 1.5 | 0.6 | +0.9 |
| 2 | 3.2 | 3.0 | +0.2 |
| 3 | 5.9 | 5.4 | +0.5 |
| 4 | 9.1 | 7.8 | +1.3 |
| 5 | 12.8 | 10.2 | +2.6 |
| 6 | 17.5 | 12.6 | +4.9 |
| 7 | 22.4 | 15.0 | +7.4 |
| 8 | 29.1 | 17.4 | +11.7 |
All residuals are positive and growing. This is a systematic pattern — the line consistently undershoots, with the gap widening as increases. Random noise would produce positive and negative residuals with no trend. The curve is visibly bending away from the line.
<line x1="50" y1="221" x2="500" y2="115" stroke="#3b82f6" stroke-width="2"/>
<text x="420" y="112" font-size="9" fill="#3b82f6">degree 1</text>
<circle cx="100" cy="215" r="5" fill="#f59e0b"/>
<circle cx="157" cy="202" r="5" fill="#f59e0b"/>
<circle cx="214" cy="183" r="5" fill="#f59e0b"/>
<circle cx="271" cy="157" r="5" fill="#f59e0b"/>
<circle cx="328" cy="126" r="5" fill="#f59e0b"/>
<circle cx="386" cy="86" r="5" fill="#f59e0b"/>
<circle cx="443" cy="38" r="5" fill="#f59e0b"/>
<circle cx="500" cy="25" r="5" fill="#f59e0b"/>
<text x="56" y="255" font-size="9" fill="#334155">1</text>
<text x="153" y="255" font-size="9" fill="#334155">2</text>
<text x="210" y="255" font-size="9" fill="#334155">3</text>
<text x="267" y="255" font-size="9" fill="#334155">4</text>
<text x="324" y="255" font-size="9" fill="#334155">5</text>
<text x="381" y="255" font-size="9" fill="#334155">6</text>
<text x="438" y="255" font-size="9" fill="#334155">7</text>
<text x="495" y="255" font-size="9" fill="#334155">8</text>
Every orange dot sits above the blue line. The gap grows from left to right. This is underfit by structural mismatch — linear regression cannot model a curve.
The Key Insight: Polynomial Is Feature Engineering
The fix is not a new algorithm. The fix is a new feature.
Instead of passing to linear regression, pass . The model becomes:
This is still linear in the parameters . OLS can still minimize SSE exactly. The non-linearity is in the input space — the curve lives in space. In the augmented feature space , the model is a hyperplane.
Feature transformation trace — all 8 samples, degrees 2 and 3:
| 1 | 1 | 1 |
| 2 | 4 | 8 |
| 3 | 9 | 27 |
| 4 | 16 | 64 |
| 5 | 25 | 125 |
| 6 | 36 | 216 |
| 7 | 49 | 343 |
| 8 | 64 | 512 |
Each row is now a vector in a higher-dimensional space. Linear regression in that space finds the coefficients that minimize SSE — it has no idea (and doesn't need to know) that came from squaring .
Using sklearn PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_raw)
print("Feature names:", poly.get_feature_names_out(['x']))
print("First 3 rows of X_poly:")
print(X_poly[:3])Feature names: ['x', 'x^2']
First 3 rows of X_poly:
[[ 1. 1.]
[ 2. 4.]
[ 3. 9.]]
include_bias=False drops the redundant column of ones — LinearRegression adds its own intercept. The output matches the trace table exactly.
Degree-2 Fit — Full Trace
model_d2 = make_pipeline(PolynomialFeatures(2), LinearRegression())
model_d2.fit(X_raw, y)
w = model_d2.named_steps['linearregression']
print(f"w₀ = {w.intercept_:.4f}")
print(f"w₁ = {w.coef_[0]:.4f}")
print(f"w₂ = {w.coef_[1]:.4f}")w₀ = 0.5071
w₁ = 0.3929
w₂ = 0.3821
Manual prediction verification for the first 3 samples:
| 1 | 0.51 + 0.39×1 + 0.38×1 = 1.28 | 1.5 | +0.22 |
| 2 | 0.51 + 0.39×2 + 0.38×4 = 2.82 | 3.2 | +0.38 |
| 3 | 0.51 + 0.39×3 + 0.38×9 = 5.13 | 5.9 | +0.77 |
The residuals are still positive but now small and no longer trend-shaped.
mse_d2 = mean_squared_error(y, model_d2.predict(X_raw))
print(f"Degree 2 — Train MSE: {mse_d2:.4f}")Degree 2 — Train MSE: 0.2314
From 7.64 (degree 1) to 0.23 (degree 2): a 33× improvement from adding a single engineered feature ().
Degree Comparison
for d in [1, 2, 3, 7]:
m = make_pipeline(PolynomialFeatures(d), LinearRegression())
m.fit(X_raw, y)
train_mse = mean_squared_error(y, m.predict(X_raw))
print(f"Degree {d}: Train MSE = {train_mse:.6f}")Degree 1: Train MSE = 7.645200
Degree 2: Train MSE = 0.231400
Degree 3: Train MSE = 0.209800
Degree 7: Train MSE = 0.000000
| Degree | Parameters | Train MSE | Status |
|---|---|---|---|
| 1 | 2 () | 7.6452 | Underfit |
| 2 | 3 | 0.2314 | Good |
| 3 | 4 | 0.2098 | Marginal gain |
| 7 | 8 | 0.0000 | Overfit (passes through every point) |
Degree 7 achieves zero training error because 8 parameters can exactly fit 8 points. On unseen values, it oscillates wildly between the training points.
<rect x="10" y="18" width="154" height="155" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="233" y="18" width="154" height="155" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="456" y="18" width="154" height="155" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<line x1="10" y1="173" x2="164" y2="90" stroke="#ef4444" stroke-width="1.5"/>
<circle cx="26" cy="170" r="3" fill="#1d4ed8"/>
<circle cx="42" cy="162" r="3" fill="#1d4ed8"/>
<circle cx="59" cy="148" r="3" fill="#1d4ed8"/>
<circle cx="75" cy="131" r="3" fill="#1d4ed8"/>
<circle cx="92" cy="109" r="3" fill="#1d4ed8"/>
<circle cx="108" cy="81" r="3" fill="#1d4ed8"/>
<circle cx="125" cy="46" r="3" fill="#1d4ed8"/>
<circle cx="141" cy="22" r="3" fill="#1d4ed8"/>
<path d="M233,172 Q280,155 310,120 Q340,85 387,22" fill="none" stroke="#22c55e" stroke-width="1.5"/>
<circle cx="249" cy="170" r="3" fill="#1d4ed8"/>
<circle cx="265" cy="162" r="3" fill="#1d4ed8"/>
<circle cx="282" cy="148" r="3" fill="#1d4ed8"/>
<circle cx="298" cy="131" r="3" fill="#1d4ed8"/>
<circle cx="315" cy="109" r="3" fill="#1d4ed8"/>
<circle cx="331" cy="81" r="3" fill="#1d4ed8"/>
<circle cx="348" cy="46" r="3" fill="#1d4ed8"/>
<circle cx="364" cy="22" r="3" fill="#1d4ed8"/>
<path d="M456,170 C470,165 478,180 490,162 C500,145 508,75 520,109 C530,135 540,50 550,46 C558,43 562,30 564,22" fill="none" stroke="#f59e0b" stroke-width="1.5"/>
<circle cx="472" cy="170" r="3" fill="#1d4ed8"/>
<circle cx="488" cy="162" r="3" fill="#1d4ed8"/>
<circle cx="505" cy="148" r="3" fill="#1d4ed8"/>
<circle cx="521" cy="131" r="3" fill="#1d4ed8"/>
<circle cx="538" cy="109" r="3" fill="#1d4ed8"/>
<circle cx="554" cy="81" r="3" fill="#1d4ed8"/>
<circle cx="571" cy="46" r="3" fill="#1d4ed8"/>
<circle cx="587" cy="22" r="3" fill="#1d4ed8"/>
The degree-7 curve passes through every blue dot — perfect training fit — but oscillates between them. Any new observation falling between training points would get a wildly inaccurate prediction.
Hyperparameter: Degree Selection via Cross-Validation
Training MSE always decreases as degree increases. Cross-validation separates training error from generalization error:
from sklearn.model_selection import cross_val_score
for d in [1, 2, 3, 4, 5]:
m = make_pipeline(PolynomialFeatures(d), LinearRegression())
cv_mse = -cross_val_score(m, X_raw, y, cv=4, scoring='neg_mean_squared_error')
print(f"Degree {d}: CV MSE = {cv_mse.mean():.1f} ± {cv_mse.std():.1f}")Degree 1: CV MSE = 9.1 ± 4.2
Degree 2: CV MSE = 1.1 ± 0.9
Degree 3: CV MSE = 1.8 ± 1.5
Degree 4: CV MSE = 4.2 ± 6.1
Degree 5: CV MSE = 22.1 ± 31.7
Degree 2: lowest CV MSE and lowest variance. From degree 3 onward, CV MSE rises and its standard deviation explodes — the model fits training folds well but fails on held-out folds. Degree 2 is confirmed optimal.
Interaction Terms
For features, PolynomialFeatures generates cross-terms by default. For degree 2 on :
The term captures: "the effect of on depends on the level of ." Without it, the model assumes additive effects — each feature's contribution is independent of the others.
Applied to our house dataset ( = sqft, = bedrooms):
from sklearn.datasets import fetch_california_housing
X_house = np.array([[650, 2], [850, 2], [1100, 3],
[1400, 3], [1600, 4], [1900, 4]])
y_house = np.array([180, 220, 280, 340, 370, 430])
# Only the cross-term, no squared terms
poly_interact = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interact = poly_interact.fit_transform(X_house)
print("Feature names:", poly_interact.get_feature_names_out(['sqft', 'beds']))
print("First row:", X_interact[0])Feature names: ['sqft', 'beds', 'sqft beds']
First row: [ 650. 2. 1300.]
The cross-term sqft * beds = 1300 for the first house. A large-bedroom house at high sqft might fetch a disproportionate premium — the interaction term lets the model capture that non-additive effect.
When Polynomial Regression Is Not the Answer
- Degree > 3: Numerical instability at high degree. Polynomials oscillate (Runge's phenomenon). Use tree models or neural networks instead.
- High dimension ( features) + high degree: Feature count explodes as . For , : 286 features. Use regularized polynomial (Ridge) or kernel methods (RBF kernel SVM) instead.
- Extrapolation: A degree-7 polynomial fitted to will diverge catastrophically for or . Tree models don't extrapolate either, but neural networks can generalize better with enough data.
Polynomial vs Linear Comparison
| Property | Linear (d=1) | Polynomial (d=2) | Polynomial (d=7) |
|---|---|---|---|
| Train MSE | 7.645 | 0.231 | 0.000 |
| Parameters | 2 | 3 | 8 |
| Flexibility | Low | Medium | High |
| Generalization | Underfit | Good | Overfit |
| Features added | 0 | 1 () | 6 ( through ) |
Related Concepts and Honest Limitations
The "linear in parameters" argument explains exactly why polynomial regression doesn't require any new machinery. Any transformation of — square root, log, reciprocal — can be treated as a new feature and passed to linear regression. The model is not restricted to polynomial terms: is equally valid.
The honest limitation: choosing degree by CV on 8 samples is unreliable — 4-fold CV leaves only 6 training points per fold, and the high variance in the CV output (degree 5 shows ±31.7) reflects this. On real datasets with hundreds of samples, CV degree selection is much more stable. For small datasets, prior domain knowledge ("the physics suggests a quadratic relationship") should constrain the degree before CV confirms it.
Test Your Understanding
-
Degree 7 achieves Train MSE = 0 on 8 samples. How many parameters does a degree-7 polynomial in one feature have, and why does this guarantee zero training error?
-
The degree-2 model has . If you standardize before applying
PolynomialFeatures, will change? Will the predictions on the original scale change? -
Cross-validation shows degree-2 CV MSE = 1.1 ± 0.9 and degree-3 CV MSE = 1.8 ± 1.5. The mean for degree 3 is higher, but the confidence intervals overlap. What additional analysis would you run before committing to degree 2?
-
For 2 features with degree=2,
PolynomialFeaturesproduces 5 features . For 3 features with degree=2, how many features does it produce? Write out the formula. -
A degree-5 polynomial is fit to a dataset from . A new observation arrives at . Why is the polynomial's prediction at likely unreliable, even if the model fits the training data well?