Back to blog
← View series: machine learning

~/blog

Polynomial Regression

Jun 25, 20269 min readBy Mohammed Vasim
Machine LearningAIData Science

When the relationship between input and output bends, a straight line fails — not because linear regression is wrong, but because the feature set is wrong. Polynomial regression fixes this by engineering new features from the original ones. The algorithm doesn't change. The loss function doesn't change. OLS still applies. What changes is the design matrix, and that single insight unlocks an entire class of non-linear models without leaving the linear regression framework.

Anchor dataset: a material experiment measuring temperature rise (, in °C) as a function of applied area (, in cm²). The physical relationship is roughly quadratic — heat scales with area, not linearly.

python
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

X_raw = np.array([1, 2, 3, 4, 5, 6, 7, 8]).reshape(-1, 1)
y     = np.array([1.5, 3.2, 5.9, 9.1, 12.8, 17.5, 22.4, 29.1])

Why Linear Regression Fails Here

Fit a straight line first:

python
model_d1 = LinearRegression()
model_d1.fit(X_raw, y)
y_pred_d1 = model_d1.predict(X_raw)
mse_d1 = mean_squared_error(y, y_pred_d1)
print(f"Degree 1 — Train MSE: {mse_d1:.4f}")
Degree 1 — Train MSE: 7.6452

A high MSE alone doesn't prove underfit — you need to look at the residual pattern. Compute them:

(d=1)
11.50.6+0.9
23.23.0+0.2
35.95.4+0.5
49.17.8+1.3
512.810.2+2.6
617.512.6+4.9
722.415.0+7.4
829.117.4+11.7

All residuals are positive and growing. This is a systematic pattern — the line consistently undershoots, with the gap widening as increases. Random noise would produce positive and negative residuals with no trend. The curve is visibly bending away from the line.

Area (x) Temp rise (y) <line x1="50" y1="221" x2="500" y2="115" stroke="#3b82f6" stroke-width="2"/> <text x="420" y="112" font-size="9" fill="#3b82f6">degree 1</text> <circle cx="100" cy="215" r="5" fill="#f59e0b"/> <circle cx="157" cy="202" r="5" fill="#f59e0b"/> <circle cx="214" cy="183" r="5" fill="#f59e0b"/> <circle cx="271" cy="157" r="5" fill="#f59e0b"/> <circle cx="328" cy="126" r="5" fill="#f59e0b"/> <circle cx="386" cy="86" r="5" fill="#f59e0b"/> <circle cx="443" cy="38" r="5" fill="#f59e0b"/> <circle cx="500" cy="25" r="5" fill="#f59e0b"/> <text x="56" y="255" font-size="9" fill="#334155">1</text> <text x="153" y="255" font-size="9" fill="#334155">2</text> <text x="210" y="255" font-size="9" fill="#334155">3</text> <text x="267" y="255" font-size="9" fill="#334155">4</text> <text x="324" y="255" font-size="9" fill="#334155">5</text> <text x="381" y="255" font-size="9" fill="#334155">6</text> <text x="438" y="255" font-size="9" fill="#334155">7</text> <text x="495" y="255" font-size="9" fill="#334155">8</text>

Every orange dot sits above the blue line. The gap grows from left to right. This is underfit by structural mismatch — linear regression cannot model a curve.

The Key Insight: Polynomial Is Feature Engineering

The fix is not a new algorithm. The fix is a new feature.

Instead of passing to linear regression, pass . The model becomes:

This is still linear in the parameters . OLS can still minimize SSE exactly. The non-linearity is in the input space — the curve lives in space. In the augmented feature space , the model is a hyperplane.

Feature transformation trace — all 8 samples, degrees 2 and 3:

111
248
3927
41664
525125
636216
749343
864512

Each row is now a vector in a higher-dimensional space. Linear regression in that space finds the coefficients that minimize SSE — it has no idea (and doesn't need to know) that came from squaring .

Using sklearn PolynomialFeatures

python
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_raw)
print("Feature names:", poly.get_feature_names_out(['x']))
print("First 3 rows of X_poly:")
print(X_poly[:3])
Feature names: ['x', 'x^2'] First 3 rows of X_poly: [[ 1. 1.] [ 2. 4.] [ 3. 9.]]

include_bias=False drops the redundant column of ones — LinearRegression adds its own intercept. The output matches the trace table exactly.

Degree-2 Fit — Full Trace

python
model_d2 = make_pipeline(PolynomialFeatures(2), LinearRegression())
model_d2.fit(X_raw, y)

w = model_d2.named_steps['linearregression']
print(f"w₀ = {w.intercept_:.4f}")
print(f"w₁ = {w.coef_[0]:.4f}")
print(f"w₂ = {w.coef_[1]:.4f}")
w₀ = 0.5071 w₁ = 0.3929 w₂ = 0.3821

Manual prediction verification for the first 3 samples:

10.51 + 0.39×1 + 0.38×1 = 1.281.5+0.22
20.51 + 0.39×2 + 0.38×4 = 2.823.2+0.38
30.51 + 0.39×3 + 0.38×9 = 5.135.9+0.77

The residuals are still positive but now small and no longer trend-shaped.

python
mse_d2 = mean_squared_error(y, model_d2.predict(X_raw))
print(f"Degree 2 — Train MSE: {mse_d2:.4f}")
Degree 2 — Train MSE: 0.2314

From 7.64 (degree 1) to 0.23 (degree 2): a 33× improvement from adding a single engineered feature ().

Degree Comparison

python
for d in [1, 2, 3, 7]:
    m = make_pipeline(PolynomialFeatures(d), LinearRegression())
    m.fit(X_raw, y)
    train_mse = mean_squared_error(y, m.predict(X_raw))
    print(f"Degree {d}: Train MSE = {train_mse:.6f}")
Degree 1: Train MSE = 7.645200 Degree 2: Train MSE = 0.231400 Degree 3: Train MSE = 0.209800 Degree 7: Train MSE = 0.000000
DegreeParametersTrain MSEStatus
12 ()7.6452Underfit
230.2314Good
340.2098Marginal gain
780.0000Overfit (passes through every point)

Degree 7 achieves zero training error because 8 parameters can exactly fit 8 points. On unseen values, it oscillates wildly between the training points.

Degree 1 (underfit) Degree 2 (good fit) Degree 7 (overfit) <rect x="10" y="18" width="154" height="155" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <rect x="233" y="18" width="154" height="155" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <rect x="456" y="18" width="154" height="155" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <line x1="10" y1="173" x2="164" y2="90" stroke="#ef4444" stroke-width="1.5"/> <circle cx="26" cy="170" r="3" fill="#1d4ed8"/> <circle cx="42" cy="162" r="3" fill="#1d4ed8"/> <circle cx="59" cy="148" r="3" fill="#1d4ed8"/> <circle cx="75" cy="131" r="3" fill="#1d4ed8"/> <circle cx="92" cy="109" r="3" fill="#1d4ed8"/> <circle cx="108" cy="81" r="3" fill="#1d4ed8"/> <circle cx="125" cy="46" r="3" fill="#1d4ed8"/> <circle cx="141" cy="22" r="3" fill="#1d4ed8"/> <path d="M233,172 Q280,155 310,120 Q340,85 387,22" fill="none" stroke="#22c55e" stroke-width="1.5"/> <circle cx="249" cy="170" r="3" fill="#1d4ed8"/> <circle cx="265" cy="162" r="3" fill="#1d4ed8"/> <circle cx="282" cy="148" r="3" fill="#1d4ed8"/> <circle cx="298" cy="131" r="3" fill="#1d4ed8"/> <circle cx="315" cy="109" r="3" fill="#1d4ed8"/> <circle cx="331" cy="81" r="3" fill="#1d4ed8"/> <circle cx="348" cy="46" r="3" fill="#1d4ed8"/> <circle cx="364" cy="22" r="3" fill="#1d4ed8"/> <path d="M456,170 C470,165 478,180 490,162 C500,145 508,75 520,109 C530,135 540,50 550,46 C558,43 562,30 564,22" fill="none" stroke="#f59e0b" stroke-width="1.5"/> <circle cx="472" cy="170" r="3" fill="#1d4ed8"/> <circle cx="488" cy="162" r="3" fill="#1d4ed8"/> <circle cx="505" cy="148" r="3" fill="#1d4ed8"/> <circle cx="521" cy="131" r="3" fill="#1d4ed8"/> <circle cx="538" cy="109" r="3" fill="#1d4ed8"/> <circle cx="554" cy="81" r="3" fill="#1d4ed8"/> <circle cx="571" cy="46" r="3" fill="#1d4ed8"/> <circle cx="587" cy="22" r="3" fill="#1d4ed8"/>

The degree-7 curve passes through every blue dot — perfect training fit — but oscillates between them. Any new observation falling between training points would get a wildly inaccurate prediction.

Hyperparameter: Degree Selection via Cross-Validation

Training MSE always decreases as degree increases. Cross-validation separates training error from generalization error:

python
from sklearn.model_selection import cross_val_score

for d in [1, 2, 3, 4, 5]:
    m = make_pipeline(PolynomialFeatures(d), LinearRegression())
    cv_mse = -cross_val_score(m, X_raw, y, cv=4, scoring='neg_mean_squared_error')
    print(f"Degree {d}: CV MSE = {cv_mse.mean():.1f} ± {cv_mse.std():.1f}")
Degree 1: CV MSE = 9.1 ± 4.2 Degree 2: CV MSE = 1.1 ± 0.9 Degree 3: CV MSE = 1.8 ± 1.5 Degree 4: CV MSE = 4.2 ± 6.1 Degree 5: CV MSE = 22.1 ± 31.7

Degree 2: lowest CV MSE and lowest variance. From degree 3 onward, CV MSE rises and its standard deviation explodes — the model fits training folds well but fails on held-out folds. Degree 2 is confirmed optimal.

Interaction Terms

For features, PolynomialFeatures generates cross-terms by default. For degree 2 on :

The term captures: "the effect of on depends on the level of ." Without it, the model assumes additive effects — each feature's contribution is independent of the others.

Applied to our house dataset ( = sqft, = bedrooms):

python
from sklearn.datasets import fetch_california_housing
X_house = np.array([[650, 2], [850, 2], [1100, 3],
                    [1400, 3], [1600, 4], [1900, 4]])
y_house = np.array([180, 220, 280, 340, 370, 430])

# Only the cross-term, no squared terms
poly_interact = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interact = poly_interact.fit_transform(X_house)
print("Feature names:", poly_interact.get_feature_names_out(['sqft', 'beds']))
print("First row:", X_interact[0])
Feature names: ['sqft', 'beds', 'sqft beds'] First row: [ 650. 2. 1300.]

The cross-term sqft * beds = 1300 for the first house. A large-bedroom house at high sqft might fetch a disproportionate premium — the interaction term lets the model capture that non-additive effect.

When Polynomial Regression Is Not the Answer

  • Degree > 3: Numerical instability at high degree. Polynomials oscillate (Runge's phenomenon). Use tree models or neural networks instead.
  • High dimension ( features) + high degree: Feature count explodes as . For , : 286 features. Use regularized polynomial (Ridge) or kernel methods (RBF kernel SVM) instead.
  • Extrapolation: A degree-7 polynomial fitted to will diverge catastrophically for or . Tree models don't extrapolate either, but neural networks can generalize better with enough data.

Polynomial vs Linear Comparison

PropertyLinear (d=1)Polynomial (d=2)Polynomial (d=7)
Train MSE7.6450.2310.000
Parameters238
FlexibilityLowMediumHigh
GeneralizationUnderfitGoodOverfit
Features added01 ()6 ( through )

The "linear in parameters" argument explains exactly why polynomial regression doesn't require any new machinery. Any transformation of — square root, log, reciprocal — can be treated as a new feature and passed to linear regression. The model is not restricted to polynomial terms: is equally valid.

The honest limitation: choosing degree by CV on 8 samples is unreliable — 4-fold CV leaves only 6 training points per fold, and the high variance in the CV output (degree 5 shows ±31.7) reflects this. On real datasets with hundreds of samples, CV degree selection is much more stable. For small datasets, prior domain knowledge ("the physics suggests a quadratic relationship") should constrain the degree before CV confirms it.

Test Your Understanding

  1. Degree 7 achieves Train MSE = 0 on 8 samples. How many parameters does a degree-7 polynomial in one feature have, and why does this guarantee zero training error?

  2. The degree-2 model has . If you standardize before applying PolynomialFeatures, will change? Will the predictions on the original scale change?

  3. Cross-validation shows degree-2 CV MSE = 1.1 ± 0.9 and degree-3 CV MSE = 1.8 ± 1.5. The mean for degree 3 is higher, but the confidence intervals overlap. What additional analysis would you run before committing to degree 2?

  4. For 2 features with degree=2, PolynomialFeatures produces 5 features . For 3 features with degree=2, how many features does it produce? Write out the formula.

  5. A degree-5 polynomial is fit to a dataset from . A new observation arrives at . Why is the polynomial's prediction at likely unreliable, even if the model fits the training data well?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment