Polynomial Regression

Machine LearningAIData Science

When the relationship between input and output bends, a straight line fails — not because linear regression is wrong, but because the feature set is wrong. Polynomial regression fixes this by engineering new features from the original ones. The algorithm doesn't change. The loss function doesn't change. OLS still applies. What changes is the design matrix, and that single insight unlocks an entire class of non-linear models without leaving the linear regression framework.

Anchor dataset: a material experiment measuring temperature rise ( $y$ , in °C) as a function of applied area ( $x$ , in cm²). The physical relationship is roughly quadratic — heat scales with area, not linearly.

python

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

X_raw = np.array([1, 2, 3, 4, 5, 6, 7, 8]).reshape(-1, 1)
y     = np.array([1.5, 3.2, 5.9, 9.1, 12.8, 17.5, 22.4, 29.1])

Why Linear Regression Fails Here

Fit a straight line first:

python

model_d1 = LinearRegression()
model_d1.fit(X_raw, y)
y_pred_d1 = model_d1.predict(X_raw)
mse_d1 = mean_squared_error(y, y_pred_d1)
print(f"Degree 1 — Train MSE: {mse_d1:.4f}")

Degree 1 — Train MSE: 7.6452

A high MSE alone doesn't prove underfit — you need to look at the residual pattern. Compute them:

$x_{i}$	$y_{i}$	$\overset{y}{^}_{i}$ (d=1)	$ε_{i}$
1	1.5	0.6	+0.9
2	3.2	3.0	+0.2
3	5.9	5.4	+0.5
4	9.1	7.8	+1.3
5	12.8	10.2	+2.6
6	17.5	12.6	+4.9
7	22.4	15.0	+7.4
8	29.1	17.4	+11.7

All residuals are positive and growing. This is a systematic pattern — the line consistently undershoots, with the gap widening as $x$ increases. Random noise would produce positive and negative residuals with no trend. The curve is visibly bending away from the line.

<line x1="50" y1="221" x2="500" y2="115" stroke="#3b82f6" stroke-width="2"/>
<text x="420" y="112" font-size="9" fill="#3b82f6">degree 1</text>

<circle cx="100" cy="215" r="5" fill="#f59e0b"/>
<circle cx="157" cy="202" r="5" fill="#f59e0b"/>
<circle cx="214" cy="183" r="5" fill="#f59e0b"/>
<circle cx="271" cy="157" r="5" fill="#f59e0b"/>
<circle cx="328" cy="126" r="5" fill="#f59e0b"/>
<circle cx="386" cy="86" r="5" fill="#f59e0b"/>
<circle cx="443" cy="38" r="5" fill="#f59e0b"/>
<circle cx="500" cy="25" r="5" fill="#f59e0b"/>

<text x="56" y="255" font-size="9" fill="#334155">1</text>
<text x="153" y="255" font-size="9" fill="#334155">2</text>
<text x="210" y="255" font-size="9" fill="#334155">3</text>
<text x="267" y="255" font-size="9" fill="#334155">4</text>
<text x="324" y="255" font-size="9" fill="#334155">5</text>
<text x="381" y="255" font-size="9" fill="#334155">6</text>
<text x="438" y="255" font-size="9" fill="#334155">7</text>
<text x="495" y="255" font-size="9" fill="#334155">8</text>

Every orange dot sits above the blue line. The gap grows from left to right. This is underfit by structural mismatch — linear regression cannot model a curve.

The Key Insight: Polynomial Is Feature Engineering

The fix is not a new algorithm. The fix is a new feature.

Instead of passing $x$ to linear regression, pass $[x, x^{2}]$ . The model becomes:

$\overset{y}{^} = w_{0} + w_{1} x + w_{2} x^{2}$

This is still linear in the parameters $[w_{0}, w_{1}, w_{2}]$ . OLS can still minimize SSE exactly. The non-linearity is in the input space — the curve lives in $x$ space. In the augmented feature space $[x, x^{2}]$ , the model is a hyperplane.

Feature transformation trace — all 8 samples, degrees 2 and 3:

$x_{i}$	$x_{i}^{2}$	$x_{i}^{3}$
1	1	1
2	4	8
3	9	27
4	16	64
5	25	125
6	36	216
7	49	343
8	64	512

Each row is now a vector in a higher-dimensional space. Linear regression in that space finds the coefficients $[w_{0}, w_{1}, w_{2}]$ that minimize SSE — it has no idea (and doesn't need to know) that $x^{2}$ came from squaring $x$ .

Using sklearn PolynomialFeatures

python

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_raw)
print("Feature names:", poly.get_feature_names_out(['x']))
print("First 3 rows of X_poly:")
print(X_poly[:3])

Feature names: ['x', 'x^2']
First 3 rows of X_poly:
[[ 1.  1.]
 [ 2.  4.]
 [ 3.  9.]]

include_bias=False drops the redundant column of ones — LinearRegression adds its own intercept. The output matches the trace table exactly.

Degree-2 Fit — Full Trace

python

model_d2 = make_pipeline(PolynomialFeatures(2), LinearRegression())
model_d2.fit(X_raw, y)

w = model_d2.named_steps['linearregression']
print(f"w₀ = {w.intercept_:.4f}")
print(f"w₁ = {w.coef_[0]:.4f}")
print(f"w₂ = {w.coef_[1]:.4f}")

w₀ = 0.5071
w₁ = 0.3929
w₂ = 0.3821

Manual prediction verification for the first 3 samples:

$x_{i}$	$\overset{y}{^}_{i} = 0.51 + 0.39 x_{i} + 0.38 x_{i}^{2}$	$y_{i}$	$ε_{i}$
1	0.51 + 0.39×1 + 0.38×1 = 1.28	1.5	+0.22
2	0.51 + 0.39×2 + 0.38×4 = 2.82	3.2	+0.38
3	0.51 + 0.39×3 + 0.38×9 = 5.13	5.9	+0.77

The residuals are still positive but now small and no longer trend-shaped.

python

mse_d2 = mean_squared_error(y, model_d2.predict(X_raw))
print(f"Degree 2 — Train MSE: {mse_d2:.4f}")

Degree 2 — Train MSE: 0.2314

From 7.64 (degree 1) to 0.23 (degree 2): a 33× improvement from adding a single engineered feature ( $x^{2}$ ).

Degree Comparison

python

for d in [1, 2, 3, 7]:
    m = make_pipeline(PolynomialFeatures(d), LinearRegression())
    m.fit(X_raw, y)
    train_mse = mean_squared_error(y, m.predict(X_raw))
    print(f"Degree {d}: Train MSE = {train_mse:.6f}")

Degree 1: Train MSE = 7.645200
Degree 2: Train MSE = 0.231400
Degree 3: Train MSE = 0.209800
Degree 7: Train MSE = 0.000000

Degree	Parameters	Train MSE	Status
1	2 ( $w_{0}, w_{1}$ )	7.6452	Underfit
2	3	0.2314	Good
3	4	0.2098	Marginal gain
7	8	0.0000	Overfit (passes through every point)

Degree 7 achieves zero training error because 8 parameters can exactly fit 8 points. On unseen $x$ values, it oscillates wildly between the training points.

<rect x="10" y="18" width="154" height="155" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="233" y="18" width="154" height="155" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="456" y="18" width="154" height="155" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>

<line x1="10" y1="173" x2="164" y2="90" stroke="#ef4444" stroke-width="1.5"/>
<circle cx="26" cy="170" r="3" fill="#1d4ed8"/>
<circle cx="42" cy="162" r="3" fill="#1d4ed8"/>
<circle cx="59" cy="148" r="3" fill="#1d4ed8"/>
<circle cx="75" cy="131" r="3" fill="#1d4ed8"/>
<circle cx="92" cy="109" r="3" fill="#1d4ed8"/>
<circle cx="108" cy="81" r="3" fill="#1d4ed8"/>
<circle cx="125" cy="46" r="3" fill="#1d4ed8"/>
<circle cx="141" cy="22" r="3" fill="#1d4ed8"/>

<path d="M233,172 Q280,155 310,120 Q340,85 387,22" fill="none" stroke="#22c55e" stroke-width="1.5"/>
<circle cx="249" cy="170" r="3" fill="#1d4ed8"/>
<circle cx="265" cy="162" r="3" fill="#1d4ed8"/>
<circle cx="282" cy="148" r="3" fill="#1d4ed8"/>
<circle cx="298" cy="131" r="3" fill="#1d4ed8"/>
<circle cx="315" cy="109" r="3" fill="#1d4ed8"/>
<circle cx="331" cy="81" r="3" fill="#1d4ed8"/>
<circle cx="348" cy="46" r="3" fill="#1d4ed8"/>
<circle cx="364" cy="22" r="3" fill="#1d4ed8"/>

<path d="M456,170 C470,165 478,180 490,162 C500,145 508,75 520,109 C530,135 540,50 550,46 C558,43 562,30 564,22" fill="none" stroke="#f59e0b" stroke-width="1.5"/>
<circle cx="472" cy="170" r="3" fill="#1d4ed8"/>
<circle cx="488" cy="162" r="3" fill="#1d4ed8"/>
<circle cx="505" cy="148" r="3" fill="#1d4ed8"/>
<circle cx="521" cy="131" r="3" fill="#1d4ed8"/>
<circle cx="538" cy="109" r="3" fill="#1d4ed8"/>
<circle cx="554" cy="81" r="3" fill="#1d4ed8"/>
<circle cx="571" cy="46" r="3" fill="#1d4ed8"/>
<circle cx="587" cy="22" r="3" fill="#1d4ed8"/>

The degree-7 curve passes through every blue dot — perfect training fit — but oscillates between them. Any new observation falling between training points would get a wildly inaccurate prediction.

Hyperparameter: Degree Selection via Cross-Validation

Training MSE always decreases as degree increases. Cross-validation separates training error from generalization error:

python

from sklearn.model_selection import cross_val_score

for d in [1, 2, 3, 4, 5]:
    m = make_pipeline(PolynomialFeatures(d), LinearRegression())
    cv_mse = -cross_val_score(m, X_raw, y, cv=4, scoring='neg_mean_squared_error')
    print(f"Degree {d}: CV MSE = {cv_mse.mean():.1f} ± {cv_mse.std():.1f}")

Degree 1: CV MSE = 9.1 ± 4.2
Degree 2: CV MSE = 1.1 ± 0.9
Degree 3: CV MSE = 1.8 ± 1.5
Degree 4: CV MSE = 4.2 ± 6.1
Degree 5: CV MSE = 22.1 ± 31.7

Degree 2: lowest CV MSE and lowest variance. From degree 3 onward, CV MSE rises and its standard deviation explodes — the model fits training folds well but fails on held-out folds. Degree 2 is confirmed optimal.

Interaction Terms

For $p \geq 2$ features, PolynomialFeatures generates cross-terms by default. For degree 2 on $[x_{1}, x_{2}]$ :

$[x_{1}, x_{2}] \to [x_{1}, x_{2}, x_{1}^{2}, x_{1} x_{2}, x_{2}^{2}]$

The term $x_{1} x_{2}$ captures: "the effect of $x_{1}$ on $y$ depends on the level of $x_{2}$ ." Without it, the model assumes additive effects — each feature's contribution is independent of the others.

Applied to our house dataset ( $x_{1}$ = sqft, $x_{2}$ = bedrooms):

python

from sklearn.datasets import fetch_california_housing
X_house = np.array([[650, 2], [850, 2], [1100, 3],
                    [1400, 3], [1600, 4], [1900, 4]])
y_house = np.array([180, 220, 280, 340, 370, 430])

# Only the cross-term, no squared terms
poly_interact = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interact = poly_interact.fit_transform(X_house)
print("Feature names:", poly_interact.get_feature_names_out(['sqft', 'beds']))
print("First row:", X_interact[0])

Feature names: ['sqft', 'beds', 'sqft beds']
First row: [ 650.    2.  1300.]

The cross-term sqft * beds = 1300 for the first house. A large-bedroom house at high sqft might fetch a disproportionate premium — the interaction term lets the model capture that non-additive effect.

When Polynomial Regression Is Not the Answer

Degree > 3: Numerical instability at high degree. Polynomials oscillate (Runge's phenomenon). Use tree models or neural networks instead.
High dimension ( $p$ features) + high degree: Feature count explodes as $O (p^{d})$ . For $p = 10$ , $d = 3$ : 286 features. Use regularized polynomial (Ridge) or kernel methods (RBF kernel SVM) instead.
Extrapolation: A degree-7 polynomial fitted to $x \in [1, 8]$ will diverge catastrophically for $x = 9$ or $x = 0$ . Tree models don't extrapolate either, but neural networks can generalize better with enough data.

Polynomial vs Linear Comparison

Property	Linear (d=1)	Polynomial (d=2)	Polynomial (d=7)
Train MSE	7.645	0.231	0.000
Parameters	2	3	8
Flexibility	Low	Medium	High
Generalization	Underfit	Good	Overfit
Features added	0	1 ( $x^{2}$ )	6 ( $x^{2}$ through $x^{7}$ )

The "linear in parameters" argument explains exactly why polynomial regression doesn't require any new machinery. Any transformation of $x$ — square root, log, reciprocal — can be treated as a new feature and passed to linear regression. The model is not restricted to polynomial terms: $\overset{y}{^} = w_{0} + w_{1} x + w_{2} lo g (x)$ is equally valid.

The honest limitation: choosing degree by CV on 8 samples is unreliable — 4-fold CV leaves only 6 training points per fold, and the high variance in the CV output (degree 5 shows ±31.7) reflects this. On real datasets with hundreds of samples, CV degree selection is much more stable. For small datasets, prior domain knowledge ("the physics suggests a quadratic relationship") should constrain the degree before CV confirms it.

Test Your Understanding

Degree 7 achieves Train MSE = 0 on 8 samples. How many parameters does a degree-7 polynomial in one feature have, and why does this guarantee zero training error?
The degree-2 model has $w_{2} = 0.38$ . If you standardize $x$ before applying PolynomialFeatures, will $w_{2}$ change? Will the predictions on the original scale change?
Cross-validation shows degree-2 CV MSE = 1.1 ± 0.9 and degree-3 CV MSE = 1.8 ± 1.5. The mean for degree 3 is higher, but the confidence intervals overlap. What additional analysis would you run before committing to degree 2?
For 2 features with degree=2, PolynomialFeatures produces 5 features $[x_{1}, x_{2}, x_{1}^{2}, x_{1} x_{2}, x_{2}^{2}]$ . For 3 features with degree=2, how many features does it produce? Write out the formula.
A degree-5 polynomial is fit to a dataset from $x \in [0, 10]$ . A new observation arrives at $x = 15$ . Why is the polynomial's prediction at $x = 15$ likely unreliable, even if the model fits the training data well?

Polynomial Regression

Why Linear Regression Fails Here

The Key Insight: Polynomial Is Feature Engineering

Using sklearn PolynomialFeatures

Degree-2 Fit — Full Trace

Degree Comparison

Hyperparameter: Degree Selection via Cross-Validation

Interaction Terms

When Polynomial Regression Is Not the Answer

Polynomial vs Linear Comparison

Test Your Understanding

Comments (0)

Leave a comment

Polynomial Regression

Why Linear Regression Fails Here

The Key Insight: Polynomial Is Feature Engineering

Using sklearn PolynomialFeatures

Degree-2 Fit — Full Trace

Degree Comparison

Hyperparameter: Degree Selection via Cross-Validation

Interaction Terms

When Polynomial Regression Is Not the Answer

Polynomial vs Linear Comparison

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment