Back to blog
← View series: machine learning

Types of Machine Learning Equation of a Line, 3D Plane, and Hyperplane Distance of a Point from a Plane Instance-Based vs Model-Based Learning Simple Linear Regression Cost Function in Linear Regression Gradient Descent Multiple Linear Regression Performance Metrics for Regression Overfitting and Underfitting Linear Regression OLS: The Normal Equation Practicals: Simple and Multiple Linear Regression Polynomial Regression Ridge, Lasso, and ElasticNet Regression Cross-Validation End-to-End ML Project: Linear Regression

~/blog

Practicals: Simple and Multiple Linear Regression

Jun 25, 2026•5 min read•By Mohammed Vasim

Machine LearningAIData Science

The theory is complete — now you run it on real data. The California Housing dataset has 20,640 census blocks, 8 features, and enough messiness (outliers, correlated features, heteroscedasticity) to expose everything a single-feature or toy dataset hides. The goal is concrete numbers you can verify and a residual analysis that tells you what the linear model still can't do.

Dataset Exploration

python

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['MedHouseVal'] = data.target

print(df.shape)
print(df.describe())
print(df.isnull().sum())

(20640, 9)

       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  MedHouseVal
count  20640.0   20640.0   20640.0    20640.0     20640.0   20640.0   20640.0    20640.0      20640.0
mean       3.9      28.6       5.4        1.1      1425.5       3.1      35.6     -119.6          2.1
std        1.9      12.6       2.5        0.5      1132.5      10.4       2.1        2.0          1.2
min        0.5       1.0       0.8        0.3         3.0       0.7      32.5     -124.4          0.1
max       15.0      52.0     141.9       34.1     35682.0    1243.3      42.0     -114.3          5.0

MedInc        0
HouseAge      0
...
dtype: int64

No null values. Two columns warrant attention:

AveRooms: max = 141.9 (mean 5.4) — a handful of census blocks have extreme values, likely data errors.
AveOccup: max = 1243.3 (mean 3.1) — severe outliers. Some sparsely populated blocks with a few homes show absurd occupancy ratios.

These outliers will inflate RMSE and create heteroscedasticity in residuals.

Simple Linear Regression — Single Feature

Start with MedInc (median income), which has the highest expected correlation with house value:

python

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X_simple = df[['MedInc']].values
y = df['MedHouseVal'].values

X_train, X_test, y_train, y_test = train_test_split(
    X_simple, y, test_size=0.2, random_state=42
)

model_simple = LinearRegression()
model_simple.fit(X_train, y_train)

print(f"Intercept:    {model_simple.intercept_:.4f}")
print(f"Coef[MedInc]: {model_simple.coef_[0]:.4f}")

Intercept:    0.4494
Coef[MedInc]: 0.4163

For each unit increase in median income ($10k), house value increases by 0.4163 × $100k = $41,630.

python

y_pred_test = model_simple.predict(X_test)
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test)):.4f}")
print(f"Test R²:   {r2_score(y_test, y_pred_test):.4f}")

Test RMSE: 0.8345
Test R²:   0.4687

$R^{2} = 0.47$ : income alone explains 47% of house value variance. RMSE of 0.83 in target units ($100k) means predictions are off by roughly $83k on average — a wide band.

<text x="290" y="268" text-anchor="middle" font-size="12" fill="#334155">MedInc (×$10k)</text>
<text x="18" y="130" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,18,130)">MedHouseVal (×$100k)</text>

<g fill="#3b82f6" opacity="0.35">
  <circle cx="80" cy="180" r="2"/><circle cx="95" cy="200" r="2"/><circle cx="90" cy="160" r="2"/>
  <circle cx="110" cy="190" r="2"/><circle cx="120" cy="150" r="2"/><circle cx="135" cy="170" r="2"/>
  <circle cx="150" cy="140" r="2"/><circle cx="160" cy="155" r="2"/><circle cx="175" cy="130" r="2"/>
  <circle cx="190" cy="145" r="2"/><circle cx="205" cy="120" r="2"/><circle cx="220" cy="105" r="2"/>
  <circle cx="240" cy="115" r="2"/><circle cx="260" cy="100" r="2"/><circle cx="280" cy="90" r="2"/>
  <circle cx="300" cy="80" r="2"/><circle cx="330" cy="75" r="2"/><circle cx="360" cy="65" r="2"/>
  <circle cx="85" cy="220" r="2"/><circle cx="100" cy="210" r="2"/><circle cx="130" cy="200" r="2"/>
  <circle cx="145" cy="185" r="2"/><circle cx="165" cy="175" r="2"/><circle cx="180" cy="165" r="2"/>
  <circle cx="215" cy="135" r="2"/><circle cx="235" cy="125" r="2"/><circle cx="270" cy="110" r="2"/>
  <circle cx="310" cy="85" r="2"/><circle cx="340" cy="70" r="2"/><circle cx="380" cy="55" r="2"/>
  <circle cx="400" cy="50" r="2"/><circle cx="430" cy="45" r="2"/><circle cx="460" cy="40" r="2"/>
</g>

<line x1="60" y1="228" x2="520" y2="28" stroke="#3b82f6" stroke-width="2"/>
<text x="420" y="45" font-size="10" fill="#3b82f6">ŷ = 0.45 + 0.42·MedInc</text>

<path d="M60,240 Q290,150 520,60" fill="none" stroke="#94a3b8" stroke-width="1" stroke-dasharray="3,3"/>
<path d="M60,210 Q290,110 520,20" fill="none" stroke="#94a3b8" stroke-width="1" stroke-dasharray="3,3"/>
<text x="120" y="240" font-size="9" fill="#94a3b8">±RMSE envelope</text>

The wide residual scatter shows that income captures the trend but leaves substantial unexplained variance — geography and house characteristics matter.

Multiple Linear Regression — All 8 Features

python

X_multi = df[data.feature_names].values

X_train, X_test, y_train, y_test = train_test_split(
    X_multi, y, test_size=0.2, random_state=42
)

model_multi = LinearRegression()
model_multi.fit(X_train, y_train)

print("Coefficients:")
for name, coef in zip(data.feature_names, model_multi.coef_):
    print(f"  {name:12s}: {coef:.4f}")
print(f"Intercept: {model_multi.intercept_:.4f}")

Coefficients:
  MedInc      :  0.4367
  HouseAge    :  0.0094
  AveRooms    : -0.1073
  AveBedrms   :  0.6454
  Population  : -0.0000
  AveOccup    : -0.0038
  Latitude    : -0.4190
  Longitude   : -0.4333
Intercept: -36.9419

Notable signs:

AveRooms: −0.107 but AveBedrms: +0.645 — these are highly correlated, and multicollinearity caused the sign flip. Don't interpret these individually.
Latitude/Longitude: both negative — house values drop as you move north or east from coastal California. The model is encoding geography.
Population: ≈0 — census block population, once other features are included, adds no marginal signal.

python

y_pred_multi = model_multi.predict(X_test)
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_multi)):.4f}")
print(f"Test R²:   {r2_score(y_test, y_pred_multi):.4f}")

Test RMSE: 0.7358
Test R²:   0.5958

$R^{2}$ improved from 0.47 (simple) to 0.60 (multiple) — 13 points from 7 additional features. RMSE dropped from 0.835 to 0.736 ($83.5k → $73.6k average error). The improvement is real but the ceiling for a linear model on this dataset is around $R^{2} = 0.60$ .

Residual Analysis

python

import matplotlib.pyplot as plt

residuals = y_test - y_pred_multi

print(f"Mean residual: {residuals.mean():.4f}")

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.hist(residuals, bins=50)
plt.xlabel('Residual')
plt.ylabel('Count')
plt.title('Residual Distribution')

plt.subplot(1, 2, 2)
plt.scatter(y_pred_multi, residuals, alpha=0.3, s=5)
plt.axhline(0, color='red')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted')
plt.tight_layout()
plt.show()

Mean residual: -0.0024

Two observations:

Residual distribution: roughly symmetric around zero (mean ≈ 0 ✓) but with heavy left tail — high-value houses are underpredicted.
Residuals vs Fitted: a fan shape for high fitted values ( $> 3.5$ ). This is heteroscedasticity — variance of errors grows with predicted value. One cause: the dataset caps house values at $500k ($5.0 in target units), so wealthy neighborhoods are systematically compressed.

Model Comparison

Model	Features	Train $R^{2}$	Test $R^{2}$	Test RMSE	Notes
Simple LR	MedInc only	0.478	0.469	0.835	Baseline single feature
Multiple LR	All 8	0.606	0.596	0.736	Meaningful improvement
(Next: Ridge)	All 8	~0.607	~0.600	~0.733	Regularization adds stability

The AveRooms/AveBedrms sign flip is multicollinearity in action. Their correlation is ~0.85 — the model assigns arbitrary individual signs to explain variance that belongs to both. Ridge regression (post 14) stabilizes this by shrinking both toward zero proportionally.

The 0.60 ceiling for $R^{2}$ on California Housing is a property of the dataset, not a flaw in linear regression. The true relationship between geography, income, and house price is non-linear — tree models (Random Forest, XGBoost) reach $R^{2} \approx 0.82$ on the same data. Linear regression is an honest baseline, not the end state.

Test Your Understanding

The multiple regression $R^{2}$ improved from 0.47 to 0.60 by adding 7 features. If you added 100 more randomly generated noise features, would $R^{2}$ increase, decrease, or stay the same? What about adjusted $R^{2}$ ?
AveRooms has a negative coefficient (−0.107) despite larger homes typically costing more. Explain how multicollinearity with AveBedrms produces this sign.
The mean residual is −0.0024 (near zero). Why is unbiasedness guaranteed by OLS on the training set? Is it guaranteed on the test set?
The residuals vs fitted plot shows a fan shape for high values. Name two transformations of $y$ that could reduce heteroscedasticity, and explain the trade-off of each.
Both AveRooms and AveBedrms have unreliable coefficients due to multicollinearity. If you had to drop one of these features, how would you decide which one to keep?

Practicals: Simple and Multiple Linear Regression

Dataset Exploration

Simple Linear Regression — Single Feature

Multiple Linear Regression — All 8 Features

Residual Analysis

Model Comparison

Test Your Understanding

Comments (0)

Leave a comment

Practicals: Simple and Multiple Linear Regression

Dataset Exploration

Simple Linear Regression — Single Feature

Multiple Linear Regression — All 8 Features

Residual Analysis

Model Comparison

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment