Multiple Linear Regression

Machine LearningAIData Science

Simple linear regression finds the best line through one feature. Multiple linear regression extends that to $p$ features simultaneously — and the key word is simultaneously. Each coefficient now measures the isolated effect of its feature while all other features are held constant. That "holding constant" is what makes multiple regression harder to interpret and more powerful than running $p$ separate simple regressions.

From Simple to Multiple

Simple: $\overset{y}{^} = w_{0} + w_{1} \times sqft$

Multiple: $\overset{y}{^} = w_{0} + w_{1} \times sqft + w_{2} \times bedrooms$

Adding bedrooms introduces $w_{2}$ , the partial effect of bedroom count while holding sqft fixed. This "ceteris paribus" interpretation — Latin for "all other things equal" — is what each coefficient in multiple regression represents. It's also why the coefficients change when you add or remove features.

Anchor dataset:

python

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

X = np.array([
    [650,  2],
    [850,  2],
    [1100, 3],
    [1400, 3],
    [1600, 4],
    [1900, 4]
])
y = np.array([180, 220, 280, 340, 370, 430])

The Design Matrix $X$

Append a column of ones for the intercept. The augmented matrix $X_{aug}$ and weight vector $w$ are:

$X_{aug} = 1111116508501100140016001900223344, w = w_{0} w_{1} w_{2}$

Predictions for all samples at once: $\hat{y} = X_{aug} w$ .

Manual prediction trace with illustrative weights $w_{0} = 30$ , $w_{1} = 0.17$ , $w_{2} = 15$ :

sq_ft	bedrooms	$1 \times w_{0}$	$sqft \times w_{1}$	$beds \times w_{2}$	$\overset{y}{^}$	$y$	$ε$
650	2	30	110.5	30	170.5	180	9.5
850	2	30	144.5	30	204.5	220	15.5
1100	3	30	187.0	45	262.0	280	18.0
1400	3	30	238.0	45	313.0	340	27.0
1600	4	30	272.0	60	362.0	370	8.0
1900	4	30	323.0	60	413.0	430	17.0

SSE = $9. 5^{2} + 15. 5^{2} + 1 8^{2} + 2 7^{2} + 8^{2} + 1 7^{2} = 90.25 + 240.25 + 324 + 729 + 64 + 289 = 1736.5$

MSE = $1736.5/6 = 289.4$ — much worse than the simple regression MSE of 22.2 because these weights aren't optimal.

OLS Solution for Multiple Features

$w^{*} = (X^{⊤} X)^{- 1} X^{⊤} y$

In practice, use np.linalg.solve instead of inverting directly:

python

X_aug = np.column_stack([np.ones(6), X])
w_ols = np.linalg.solve(X_aug.T @ X_aug, X_aug.T @ y)
print(f"w₀ = {w_ols[0]:.4f}")
print(f"w₁ = {w_ols[1]:.6f}")
print(f"w₂ = {w_ols[2]:.4f}")

w₀ = 58.3271
w₁ = 0.175000
w₂ = 9.6053

Coefficient Interpretation

$w_{0} = 58.33$ : baseline price when sqft = 0 and bedrooms = 0. Mathematically required; not a meaningful real-world prediction (no house has zero sqft).
$w_{1} = 0.175$ : holding bedrooms fixed, each additional square foot adds $175 to the predicted price.
$w_{2} = 9.61$ : holding sqft fixed, each additional bedroom adds $9,610 to the predicted price.

Critical observation: $w_{1}$ changed from 0.200 (simple regression) to 0.175 (multiple regression). This is omitted variable bias. Larger houses tend to have more bedrooms — in simple regression, the sqft coefficient was absorbing bedroom effects. Adding bedrooms to the model isolated each variable's true independent contribution.

<rect x="10" y="20" width="260" height="200" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="290" y="20" width="260" height="200" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>

<line x1="10" y1="220" x2="270" y2="220" stroke="#334155" stroke-width="1"/>
<line x1="10" y1="20" x2="10" y2="220" stroke="#334155" stroke-width="1"/>
<line x1="290" y1="220" x2="550" y2="220" stroke="#334155" stroke-width="1"/>
<line x1="290" y1="20" x2="290" y2="220" stroke="#334155" stroke-width="1"/>

<line x1="10" y1="220" x2="270" y2="28" stroke="#3b82f6" stroke-width="1.5"/>
<text x="200" y="45" font-size="9" fill="#3b82f6">slope=0.20</text>

<circle cx="42" cy="195" r="4" fill="#1d4ed8"/>
<circle cx="78" cy="175" r="4" fill="#1d4ed8"/>
<circle cx="132" cy="148" r="4" fill="#1d4ed8"/>
<circle cx="186" cy="115" r="4" fill="#1d4ed8"/>
<circle cx="216" cy="100" r="4" fill="#1d4ed8"/>
<circle cx="264" cy="70" r="4" fill="#1d4ed8"/>

<line x1="290" y1="215" x2="550" y2="35" stroke="#3b82f6" stroke-width="1.5"/>
<text x="480" y="55" font-size="9" fill="#3b82f6">slope=0.175</text>

<circle cx="322" cy="195" r="4" fill="#dc2626"/>
<circle cx="358" cy="177" r="4" fill="#dc2626"/>
<circle cx="412" cy="152" r="4" fill="#22c55e"/>
<circle cx="466" cy="120" r="4" fill="#22c55e"/>
<circle cx="496" cy="104" r="4" fill="#1d4ed8"/>
<circle cx="544" cy="74" r="4" fill="#1d4ed8"/>

<text x="295" y="240" font-size="9" fill="#dc2626">■ 2 beds</text>
<text x="340" y="240" font-size="9" fill="#22c55e">■ 3 beds</text>
<text x="385" y="240" font-size="9" fill="#1d4ed8">■ 4 beds</text>
<text x="10" y="240" font-size="9" fill="#64748b">all mixed</text>

Within each bedroom group, the sqft slope is shallower (≈0.175) than the combined slope (0.200). The combined slope was inflated because larger sqft correlates with more bedrooms.

Predicting with sklearn

python

model = LinearRegression()
model.fit(X, y)

print(f"w₀: {model.intercept_:.2f}")
print(f"w₁ (sqft):     {model.coef_[0]:.4f}")
print(f"w₂ (bedrooms): {model.coef_[1]:.4f}")

w₀: 58.30
w₁ (sqft):     0.1750
w₂ (bedrooms): 9.6000

python

new_house = np.array([[1200, 3]])
print(f"Predicted price: ${model.predict(new_house)[0]:.1f}k")

Predicted price: $297.3k

Manual check: $58.3 + 0.175 \times 1200 + 9.6 \times 3 = 58.3 + 210 + 28.8 = \$ 297.1k$ ✓

Multicollinearity Warning

If sqft and bedrooms are highly correlated, $X^{⊤} X$ becomes nearly singular — the inversion is numerically unstable.

python

np.corrcoef(X[:, 0], X[:, 1])[0, 1]

0.9972

Correlation of 0.997 is nearly perfect multicollinearity. On a real dataset this would cause coefficient instability — tiny changes in the training data could swing $w_{1}$ and $w_{2}$ dramatically while leaving predictions nearly the same. This is the signal to apply Ridge regularization (post 14).

Simple vs Multiple Regression Comparison

Aspect	Simple LR	Multiple LR
Model	$\overset{y}{^} = w_{0} + w_{1} x$	$\overset{y}{^} = w_{0} + w_{1} x_{1} + \dots + w_{p} x_{p}$
Geometry	Line (2D)	Hyperplane ( $p + 1$ D)
OLS solution	Normal equations ( $2 \times 2$ )	$(X^{⊤} X)^{- 1} X^{⊤} y$ ( $(p + 1) \times (p + 1)$ )
Coefficient meaning	Slope of the line	Partial effect (all others held constant)

Multiple regression's "partial effect" interpretation is only valid when the other features are actually held constant — which requires them to be linearly independent. Near-multicollinearity breaks this assumption without making the model fail to fit. The model might achieve low training error while the individual coefficients are statistically meaningless. Always check VIF (Variance Inflation Factor) when features are correlated.

Adding features always reduces or holds constant the training SSE — it can never increase it. Adjusted $R^{2}$ (next post) corrects for this by penalizing model complexity.

Test Your Understanding

The simple regression coefficient for sqft was 0.200 and dropped to 0.175 in multiple regression. Intuitively, what happens to this coefficient if you add a third feature that is completely uncorrelated with sqft?
Compute the prediction for a house with sqft = 750 and 2 bedrooms using the optimal OLS weights ( $w_{0} = 58.3$ , $w_{1} = 0.175$ , $w_{2} = 9.6$ ). Compare to the simple regression prediction for sqft = 750. Which is higher and why?
If bedrooms and sqft had zero correlation in our dataset, would the multiple regression $w_{1}$ still differ from the simple regression $w_{1}$ ? Why?
In the design matrix $X_{aug}$ , what happens if two columns are identical? What does $X^{⊤} X$ look like and why does $(X^{⊤} X)^{- 1}$ fail?
The OLS solution minimizes SSE over all possible $w$ . Could you find weights with lower SSE on the training set by adding a random noise feature? Would that generalize to test data?

Multiple Linear Regression

From Simple to Multiple

The Design Matrix $X$

OLS Solution for Multiple Features

Coefficient Interpretation

Predicting with sklearn

Multicollinearity Warning

Simple vs Multiple Regression Comparison

Test Your Understanding

Comments (0)

Leave a comment

Multiple Linear Regression

From Simple to Multiple

The Design Matrix X

OLS Solution for Multiple Features

Coefficient Interpretation

Predicting with sklearn

Multicollinearity Warning

Simple vs Multiple Regression Comparison

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

The Design Matrix $X$