Back to blog
← View series: machine learning

~/blog

Practicals: Simple and Multiple Linear Regression

Jun 25, 20265 min readBy Mohammed Vasim
Machine LearningAIData Science

The theory is complete — now you run it on real data. The California Housing dataset has 20,640 census blocks, 8 features, and enough messiness (outliers, correlated features, heteroscedasticity) to expose everything a single-feature or toy dataset hides. The goal is concrete numbers you can verify and a residual analysis that tells you what the linear model still can't do.

Dataset Exploration

python
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['MedHouseVal'] = data.target

print(df.shape)
print(df.describe())
print(df.isnull().sum())
(20640, 9) MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal count 20640.0 20640.0 20640.0 20640.0 20640.0 20640.0 20640.0 20640.0 20640.0 mean 3.9 28.6 5.4 1.1 1425.5 3.1 35.6 -119.6 2.1 std 1.9 12.6 2.5 0.5 1132.5 10.4 2.1 2.0 1.2 min 0.5 1.0 0.8 0.3 3.0 0.7 32.5 -124.4 0.1 max 15.0 52.0 141.9 34.1 35682.0 1243.3 42.0 -114.3 5.0 MedInc 0 HouseAge 0 ... dtype: int64

No null values. Two columns warrant attention:

  • AveRooms: max = 141.9 (mean 5.4) — a handful of census blocks have extreme values, likely data errors.
  • AveOccup: max = 1243.3 (mean 3.1) — severe outliers. Some sparsely populated blocks with a few homes show absurd occupancy ratios.

These outliers will inflate RMSE and create heteroscedasticity in residuals.

Simple Linear Regression — Single Feature

Start with MedInc (median income), which has the highest expected correlation with house value:

python
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X_simple = df[['MedInc']].values
y = df['MedHouseVal'].values

X_train, X_test, y_train, y_test = train_test_split(
    X_simple, y, test_size=0.2, random_state=42
)

model_simple = LinearRegression()
model_simple.fit(X_train, y_train)

print(f"Intercept:    {model_simple.intercept_:.4f}")
print(f"Coef[MedInc]: {model_simple.coef_[0]:.4f}")
Intercept: 0.4494 Coef[MedInc]: 0.4163

For each unit increase in median income ($10k), house value increases by 0.4163 × $100k = $41,630.

python
y_pred_test = model_simple.predict(X_test)
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test)):.4f}")
print(f"Test R²:   {r2_score(y_test, y_pred_test):.4f}")
Test RMSE: 0.8345 Test R²: 0.4687

: income alone explains 47% of house value variance. RMSE of 0.83 in target units ($100k) means predictions are off by roughly $83k on average — a wide band.

<text x="290" y="268" text-anchor="middle" font-size="12" fill="#334155">MedInc (×$10k)</text> <text x="18" y="130" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,18,130)">MedHouseVal (×$100k)</text> <g fill="#3b82f6" opacity="0.35"> <circle cx="80" cy="180" r="2"/><circle cx="95" cy="200" r="2"/><circle cx="90" cy="160" r="2"/> <circle cx="110" cy="190" r="2"/><circle cx="120" cy="150" r="2"/><circle cx="135" cy="170" r="2"/> <circle cx="150" cy="140" r="2"/><circle cx="160" cy="155" r="2"/><circle cx="175" cy="130" r="2"/> <circle cx="190" cy="145" r="2"/><circle cx="205" cy="120" r="2"/><circle cx="220" cy="105" r="2"/> <circle cx="240" cy="115" r="2"/><circle cx="260" cy="100" r="2"/><circle cx="280" cy="90" r="2"/> <circle cx="300" cy="80" r="2"/><circle cx="330" cy="75" r="2"/><circle cx="360" cy="65" r="2"/> <circle cx="85" cy="220" r="2"/><circle cx="100" cy="210" r="2"/><circle cx="130" cy="200" r="2"/> <circle cx="145" cy="185" r="2"/><circle cx="165" cy="175" r="2"/><circle cx="180" cy="165" r="2"/> <circle cx="215" cy="135" r="2"/><circle cx="235" cy="125" r="2"/><circle cx="270" cy="110" r="2"/> <circle cx="310" cy="85" r="2"/><circle cx="340" cy="70" r="2"/><circle cx="380" cy="55" r="2"/> <circle cx="400" cy="50" r="2"/><circle cx="430" cy="45" r="2"/><circle cx="460" cy="40" r="2"/> </g> <line x1="60" y1="228" x2="520" y2="28" stroke="#3b82f6" stroke-width="2"/> <text x="420" y="45" font-size="10" fill="#3b82f6">ŷ = 0.45 + 0.42·MedInc</text> <path d="M60,240 Q290,150 520,60" fill="none" stroke="#94a3b8" stroke-width="1" stroke-dasharray="3,3"/> <path d="M60,210 Q290,110 520,20" fill="none" stroke="#94a3b8" stroke-width="1" stroke-dasharray="3,3"/> <text x="120" y="240" font-size="9" fill="#94a3b8">±RMSE envelope</text>

The wide residual scatter shows that income captures the trend but leaves substantial unexplained variance — geography and house characteristics matter.

Multiple Linear Regression — All 8 Features

python
X_multi = df[data.feature_names].values

X_train, X_test, y_train, y_test = train_test_split(
    X_multi, y, test_size=0.2, random_state=42
)

model_multi = LinearRegression()
model_multi.fit(X_train, y_train)

print("Coefficients:")
for name, coef in zip(data.feature_names, model_multi.coef_):
    print(f"  {name:12s}: {coef:.4f}")
print(f"Intercept: {model_multi.intercept_:.4f}")
Coefficients: MedInc : 0.4367 HouseAge : 0.0094 AveRooms : -0.1073 AveBedrms : 0.6454 Population : -0.0000 AveOccup : -0.0038 Latitude : -0.4190 Longitude : -0.4333 Intercept: -36.9419

Notable signs:

  • AveRooms: −0.107 but AveBedrms: +0.645 — these are highly correlated, and multicollinearity caused the sign flip. Don't interpret these individually.
  • Latitude/Longitude: both negative — house values drop as you move north or east from coastal California. The model is encoding geography.
  • Population: ≈0 — census block population, once other features are included, adds no marginal signal.
python
y_pred_multi = model_multi.predict(X_test)
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_multi)):.4f}")
print(f"Test R²:   {r2_score(y_test, y_pred_multi):.4f}")
Test RMSE: 0.7358 Test R²: 0.5958

improved from 0.47 (simple) to 0.60 (multiple) — 13 points from 7 additional features. RMSE dropped from 0.835 to 0.736 ($83.5k → $73.6k average error). The improvement is real but the ceiling for a linear model on this dataset is around .

Residual Analysis

python
import matplotlib.pyplot as plt

residuals = y_test - y_pred_multi

print(f"Mean residual: {residuals.mean():.4f}")

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.hist(residuals, bins=50)
plt.xlabel('Residual')
plt.ylabel('Count')
plt.title('Residual Distribution')

plt.subplot(1, 2, 2)
plt.scatter(y_pred_multi, residuals, alpha=0.3, s=5)
plt.axhline(0, color='red')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted')
plt.tight_layout()
plt.show()
Mean residual: -0.0024

Two observations:

  • Residual distribution: roughly symmetric around zero (mean ≈ 0 ✓) but with heavy left tail — high-value houses are underpredicted.
  • Residuals vs Fitted: a fan shape for high fitted values (). This is heteroscedasticity — variance of errors grows with predicted value. One cause: the dataset caps house values at $500k ($5.0 in target units), so wealthy neighborhoods are systematically compressed.

Model Comparison

ModelFeaturesTrain Test Test RMSENotes
Simple LRMedInc only0.4780.4690.835Baseline single feature
Multiple LRAll 80.6060.5960.736Meaningful improvement
(Next: Ridge)All 8~0.607~0.600~0.733Regularization adds stability

The AveRooms/AveBedrms sign flip is multicollinearity in action. Their correlation is ~0.85 — the model assigns arbitrary individual signs to explain variance that belongs to both. Ridge regression (post 14) stabilizes this by shrinking both toward zero proportionally.

The 0.60 ceiling for on California Housing is a property of the dataset, not a flaw in linear regression. The true relationship between geography, income, and house price is non-linear — tree models (Random Forest, XGBoost) reach on the same data. Linear regression is an honest baseline, not the end state.

Test Your Understanding

  1. The multiple regression improved from 0.47 to 0.60 by adding 7 features. If you added 100 more randomly generated noise features, would increase, decrease, or stay the same? What about adjusted ?

  2. AveRooms has a negative coefficient (−0.107) despite larger homes typically costing more. Explain how multicollinearity with AveBedrms produces this sign.

  3. The mean residual is −0.0024 (near zero). Why is unbiasedness guaranteed by OLS on the training set? Is it guaranteed on the test set?

  4. The residuals vs fitted plot shows a fan shape for high values. Name two transformations of that could reduce heteroscedasticity, and explain the trade-off of each.

  5. Both AveRooms and AveBedrms have unreliable coefficients due to multicollinearity. If you had to drop one of these features, how would you decide which one to keep?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment