Back to blog
← View series: machine learning

~/blog

Multiple Linear Regression

Jun 25, 20266 min readBy Mohammed Vasim
Machine LearningAIData Science

Simple linear regression finds the best line through one feature. Multiple linear regression extends that to features simultaneously — and the key word is simultaneously. Each coefficient now measures the isolated effect of its feature while all other features are held constant. That "holding constant" is what makes multiple regression harder to interpret and more powerful than running separate simple regressions.

From Simple to Multiple

Simple:

Multiple:

Adding bedrooms introduces , the partial effect of bedroom count while holding sqft fixed. This "ceteris paribus" interpretation — Latin for "all other things equal" — is what each coefficient in multiple regression represents. It's also why the coefficients change when you add or remove features.

Anchor dataset:

python
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

X = np.array([
    [650,  2],
    [850,  2],
    [1100, 3],
    [1400, 3],
    [1600, 4],
    [1900, 4]
])
y = np.array([180, 220, 280, 340, 370, 430])

The Design Matrix

Append a column of ones for the intercept. The augmented matrix and weight vector are:

Predictions for all samples at once: .

Manual prediction trace with illustrative weights , , :

sq_ftbedrooms
650230110.530170.51809.5
850230144.530204.522015.5
1100330187.045262.028018.0
1400330238.045313.034027.0
1600430272.060362.03708.0
1900430323.060413.043017.0

SSE =

MSE = — much worse than the simple regression MSE of 22.2 because these weights aren't optimal.

OLS Solution for Multiple Features

In practice, use np.linalg.solve instead of inverting directly:

python
X_aug = np.column_stack([np.ones(6), X])
w_ols = np.linalg.solve(X_aug.T @ X_aug, X_aug.T @ y)
print(f"w₀ = {w_ols[0]:.4f}")
print(f"w₁ = {w_ols[1]:.6f}")
print(f"w₂ = {w_ols[2]:.4f}")
w₀ = 58.3271 w₁ = 0.175000 w₂ = 9.6053

Coefficient Interpretation

  • : baseline price when sqft = 0 and bedrooms = 0. Mathematically required; not a meaningful real-world prediction (no house has zero sqft).
  • : holding bedrooms fixed, each additional square foot adds $175 to the predicted price.
  • : holding sqft fixed, each additional bedroom adds $9,610 to the predicted price.

Critical observation: changed from 0.200 (simple regression) to 0.175 (multiple regression). This is omitted variable bias. Larger houses tend to have more bedrooms — in simple regression, the sqft coefficient was absorbing bedroom effects. Adding bedrooms to the model isolated each variable's true independent contribution.

Simple LR (w₁=0.200) Multiple LR (w₁=0.175) <rect x="10" y="20" width="260" height="200" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <rect x="290" y="20" width="260" height="200" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <line x1="10" y1="220" x2="270" y2="220" stroke="#334155" stroke-width="1"/> <line x1="10" y1="20" x2="10" y2="220" stroke="#334155" stroke-width="1"/> <line x1="290" y1="220" x2="550" y2="220" stroke="#334155" stroke-width="1"/> <line x1="290" y1="20" x2="290" y2="220" stroke="#334155" stroke-width="1"/> <line x1="10" y1="220" x2="270" y2="28" stroke="#3b82f6" stroke-width="1.5"/> <text x="200" y="45" font-size="9" fill="#3b82f6">slope=0.20</text> <circle cx="42" cy="195" r="4" fill="#1d4ed8"/> <circle cx="78" cy="175" r="4" fill="#1d4ed8"/> <circle cx="132" cy="148" r="4" fill="#1d4ed8"/> <circle cx="186" cy="115" r="4" fill="#1d4ed8"/> <circle cx="216" cy="100" r="4" fill="#1d4ed8"/> <circle cx="264" cy="70" r="4" fill="#1d4ed8"/> <line x1="290" y1="215" x2="550" y2="35" stroke="#3b82f6" stroke-width="1.5"/> <text x="480" y="55" font-size="9" fill="#3b82f6">slope=0.175</text> <circle cx="322" cy="195" r="4" fill="#dc2626"/> <circle cx="358" cy="177" r="4" fill="#dc2626"/> <circle cx="412" cy="152" r="4" fill="#22c55e"/> <circle cx="466" cy="120" r="4" fill="#22c55e"/> <circle cx="496" cy="104" r="4" fill="#1d4ed8"/> <circle cx="544" cy="74" r="4" fill="#1d4ed8"/> <text x="295" y="240" font-size="9" fill="#dc2626">■ 2 beds</text> <text x="340" y="240" font-size="9" fill="#22c55e">■ 3 beds</text> <text x="385" y="240" font-size="9" fill="#1d4ed8">■ 4 beds</text> <text x="10" y="240" font-size="9" fill="#64748b">all mixed</text>

Within each bedroom group, the sqft slope is shallower (≈0.175) than the combined slope (0.200). The combined slope was inflated because larger sqft correlates with more bedrooms.

Predicting with sklearn

python
model = LinearRegression()
model.fit(X, y)

print(f"w₀: {model.intercept_:.2f}")
print(f"w₁ (sqft):     {model.coef_[0]:.4f}")
print(f"w₂ (bedrooms): {model.coef_[1]:.4f}")
w₀: 58.30 w₁ (sqft): 0.1750 w₂ (bedrooms): 9.6000
python
new_house = np.array([[1200, 3]])
print(f"Predicted price: ${model.predict(new_house)[0]:.1f}k")
Predicted price: $297.3k

Manual check: 58.3 + 0.175 \times 1200 + 9.6 \times 3 = 58.3 + 210 + 28.8 = \297.1k$ ✓

Multicollinearity Warning

If sqft and bedrooms are highly correlated, becomes nearly singular — the inversion is numerically unstable.

python
np.corrcoef(X[:, 0], X[:, 1])[0, 1]
0.9972

Correlation of 0.997 is nearly perfect multicollinearity. On a real dataset this would cause coefficient instability — tiny changes in the training data could swing and dramatically while leaving predictions nearly the same. This is the signal to apply Ridge regularization (post 14).

Simple vs Multiple Regression Comparison

AspectSimple LRMultiple LR
Model
GeometryLine (2D)Hyperplane ( D)
OLS solutionNormal equations () ()
Coefficient meaningSlope of the linePartial effect (all others held constant)

Multiple regression's "partial effect" interpretation is only valid when the other features are actually held constant — which requires them to be linearly independent. Near-multicollinearity breaks this assumption without making the model fail to fit. The model might achieve low training error while the individual coefficients are statistically meaningless. Always check VIF (Variance Inflation Factor) when features are correlated.

Adding features always reduces or holds constant the training SSE — it can never increase it. Adjusted (next post) corrects for this by penalizing model complexity.

Test Your Understanding

  1. The simple regression coefficient for sqft was 0.200 and dropped to 0.175 in multiple regression. Intuitively, what happens to this coefficient if you add a third feature that is completely uncorrelated with sqft?

  2. Compute the prediction for a house with sqft = 750 and 2 bedrooms using the optimal OLS weights (, , ). Compare to the simple regression prediction for sqft = 750. Which is higher and why?

  3. If bedrooms and sqft had zero correlation in our dataset, would the multiple regression still differ from the simple regression ? Why?

  4. In the design matrix , what happens if two columns are identical? What does look like and why does fail?

  5. The OLS solution minimizes SSE over all possible . Could you find weights with lower SSE on the training set by adding a random noise feature? Would that generalize to test data?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment