← View series: machine learning
~/blog
Practicals: Simple and Multiple Linear Regression
The theory is complete — now you run it on real data. The California Housing dataset has 20,640 census blocks, 8 features, and enough messiness (outliers, correlated features, heteroscedasticity) to expose everything a single-feature or toy dataset hides. The goal is concrete numbers you can verify and a residual analysis that tells you what the linear model still can't do.
Dataset Exploration
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['MedHouseVal'] = data.target
print(df.shape)
print(df.describe())
print(df.isnull().sum())(20640, 9)
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
count 20640.0 20640.0 20640.0 20640.0 20640.0 20640.0 20640.0 20640.0 20640.0
mean 3.9 28.6 5.4 1.1 1425.5 3.1 35.6 -119.6 2.1
std 1.9 12.6 2.5 0.5 1132.5 10.4 2.1 2.0 1.2
min 0.5 1.0 0.8 0.3 3.0 0.7 32.5 -124.4 0.1
max 15.0 52.0 141.9 34.1 35682.0 1243.3 42.0 -114.3 5.0
MedInc 0
HouseAge 0
...
dtype: int64
No null values. Two columns warrant attention:
- AveRooms: max = 141.9 (mean 5.4) — a handful of census blocks have extreme values, likely data errors.
- AveOccup: max = 1243.3 (mean 3.1) — severe outliers. Some sparsely populated blocks with a few homes show absurd occupancy ratios.
These outliers will inflate RMSE and create heteroscedasticity in residuals.
Simple Linear Regression — Single Feature
Start with MedInc (median income), which has the highest expected correlation with house value:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
X_simple = df[['MedInc']].values
y = df['MedHouseVal'].values
X_train, X_test, y_train, y_test = train_test_split(
X_simple, y, test_size=0.2, random_state=42
)
model_simple = LinearRegression()
model_simple.fit(X_train, y_train)
print(f"Intercept: {model_simple.intercept_:.4f}")
print(f"Coef[MedInc]: {model_simple.coef_[0]:.4f}")Intercept: 0.4494
Coef[MedInc]: 0.4163
For each unit increase in median income ($10k), house value increases by 0.4163 × $100k = $41,630.
y_pred_test = model_simple.predict(X_test)
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test)):.4f}")
print(f"Test R²: {r2_score(y_test, y_pred_test):.4f}")Test RMSE: 0.8345
Test R²: 0.4687
: income alone explains 47% of house value variance. RMSE of 0.83 in target units ($100k) means predictions are off by roughly $83k on average — a wide band.
<text x="290" y="268" text-anchor="middle" font-size="12" fill="#334155">MedInc (×$10k)</text>
<text x="18" y="130" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,18,130)">MedHouseVal (×$100k)</text>
<g fill="#3b82f6" opacity="0.35">
<circle cx="80" cy="180" r="2"/><circle cx="95" cy="200" r="2"/><circle cx="90" cy="160" r="2"/>
<circle cx="110" cy="190" r="2"/><circle cx="120" cy="150" r="2"/><circle cx="135" cy="170" r="2"/>
<circle cx="150" cy="140" r="2"/><circle cx="160" cy="155" r="2"/><circle cx="175" cy="130" r="2"/>
<circle cx="190" cy="145" r="2"/><circle cx="205" cy="120" r="2"/><circle cx="220" cy="105" r="2"/>
<circle cx="240" cy="115" r="2"/><circle cx="260" cy="100" r="2"/><circle cx="280" cy="90" r="2"/>
<circle cx="300" cy="80" r="2"/><circle cx="330" cy="75" r="2"/><circle cx="360" cy="65" r="2"/>
<circle cx="85" cy="220" r="2"/><circle cx="100" cy="210" r="2"/><circle cx="130" cy="200" r="2"/>
<circle cx="145" cy="185" r="2"/><circle cx="165" cy="175" r="2"/><circle cx="180" cy="165" r="2"/>
<circle cx="215" cy="135" r="2"/><circle cx="235" cy="125" r="2"/><circle cx="270" cy="110" r="2"/>
<circle cx="310" cy="85" r="2"/><circle cx="340" cy="70" r="2"/><circle cx="380" cy="55" r="2"/>
<circle cx="400" cy="50" r="2"/><circle cx="430" cy="45" r="2"/><circle cx="460" cy="40" r="2"/>
</g>
<line x1="60" y1="228" x2="520" y2="28" stroke="#3b82f6" stroke-width="2"/>
<text x="420" y="45" font-size="10" fill="#3b82f6">ŷ = 0.45 + 0.42·MedInc</text>
<path d="M60,240 Q290,150 520,60" fill="none" stroke="#94a3b8" stroke-width="1" stroke-dasharray="3,3"/>
<path d="M60,210 Q290,110 520,20" fill="none" stroke="#94a3b8" stroke-width="1" stroke-dasharray="3,3"/>
<text x="120" y="240" font-size="9" fill="#94a3b8">±RMSE envelope</text>
The wide residual scatter shows that income captures the trend but leaves substantial unexplained variance — geography and house characteristics matter.
Multiple Linear Regression — All 8 Features
X_multi = df[data.feature_names].values
X_train, X_test, y_train, y_test = train_test_split(
X_multi, y, test_size=0.2, random_state=42
)
model_multi = LinearRegression()
model_multi.fit(X_train, y_train)
print("Coefficients:")
for name, coef in zip(data.feature_names, model_multi.coef_):
print(f" {name:12s}: {coef:.4f}")
print(f"Intercept: {model_multi.intercept_:.4f}")Coefficients:
MedInc : 0.4367
HouseAge : 0.0094
AveRooms : -0.1073
AveBedrms : 0.6454
Population : -0.0000
AveOccup : -0.0038
Latitude : -0.4190
Longitude : -0.4333
Intercept: -36.9419
Notable signs:
- AveRooms: −0.107 but AveBedrms: +0.645 — these are highly correlated, and multicollinearity caused the sign flip. Don't interpret these individually.
- Latitude/Longitude: both negative — house values drop as you move north or east from coastal California. The model is encoding geography.
- Population: ≈0 — census block population, once other features are included, adds no marginal signal.
y_pred_multi = model_multi.predict(X_test)
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_multi)):.4f}")
print(f"Test R²: {r2_score(y_test, y_pred_multi):.4f}")Test RMSE: 0.7358
Test R²: 0.5958
improved from 0.47 (simple) to 0.60 (multiple) — 13 points from 7 additional features. RMSE dropped from 0.835 to 0.736 ($83.5k → $73.6k average error). The improvement is real but the ceiling for a linear model on this dataset is around .
Residual Analysis
import matplotlib.pyplot as plt
residuals = y_test - y_pred_multi
print(f"Mean residual: {residuals.mean():.4f}")
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(residuals, bins=50)
plt.xlabel('Residual')
plt.ylabel('Count')
plt.title('Residual Distribution')
plt.subplot(1, 2, 2)
plt.scatter(y_pred_multi, residuals, alpha=0.3, s=5)
plt.axhline(0, color='red')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted')
plt.tight_layout()
plt.show()Mean residual: -0.0024
Two observations:
- Residual distribution: roughly symmetric around zero (mean ≈ 0 ✓) but with heavy left tail — high-value houses are underpredicted.
- Residuals vs Fitted: a fan shape for high fitted values (). This is heteroscedasticity — variance of errors grows with predicted value. One cause: the dataset caps house values at $500k ($5.0 in target units), so wealthy neighborhoods are systematically compressed.
Model Comparison
| Model | Features | Train | Test | Test RMSE | Notes |
|---|---|---|---|---|---|
| Simple LR | MedInc only | 0.478 | 0.469 | 0.835 | Baseline single feature |
| Multiple LR | All 8 | 0.606 | 0.596 | 0.736 | Meaningful improvement |
| (Next: Ridge) | All 8 | ~0.607 | ~0.600 | ~0.733 | Regularization adds stability |
Related Concepts and Honest Limitations
The AveRooms/AveBedrms sign flip is multicollinearity in action. Their correlation is ~0.85 — the model assigns arbitrary individual signs to explain variance that belongs to both. Ridge regression (post 14) stabilizes this by shrinking both toward zero proportionally.
The 0.60 ceiling for on California Housing is a property of the dataset, not a flaw in linear regression. The true relationship between geography, income, and house price is non-linear — tree models (Random Forest, XGBoost) reach on the same data. Linear regression is an honest baseline, not the end state.
Test Your Understanding
-
The multiple regression improved from 0.47 to 0.60 by adding 7 features. If you added 100 more randomly generated noise features, would increase, decrease, or stay the same? What about adjusted ?
-
AveRooms has a negative coefficient (−0.107) despite larger homes typically costing more. Explain how multicollinearity with AveBedrms produces this sign.
-
The mean residual is −0.0024 (near zero). Why is unbiasedness guaranteed by OLS on the training set? Is it guaranteed on the test set?
-
The residuals vs fitted plot shows a fan shape for high values. Name two transformations of that could reduce heteroscedasticity, and explain the trade-off of each.
-
Both AveRooms and AveBedrms have unreliable coefficients due to multicollinearity. If you had to drop one of these features, how would you decide which one to keep?