Back to blog
← View series: machine learning

~/blog

End-to-End ML Project: Linear Regression

Jun 25, 20268 min readBy Mohammed Vasim
Machine LearningAIData Science

The previous posts built each component in isolation — OLS, gradient descent, regularization, CV. This post assembles them into a production pipeline: problem definition through deployment, with concrete numbers at every step. The goal is not just a working model but one you can hand to a stakeholder who needs to trust it.

Step 1: Problem Definition

Task: Predict median house value (regression) for California census blocks from neighborhood statistics.

Business context: A real estate pricing API that estimates property values for insurance underwriting. The model must return a prediction within 10ms and be interpretable — underwriters need to explain why a neighborhood is priced a certain way, which rules out black-box models for this use case.

Success metric: Test RMSE < 0.70 (in target units of $100k → < $70k average error)

Constraints: Interpretable model (linear family), fast inference, no external data.

Step 2: EDA

python
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['MedHouseVal'] = data.target

print(df.isnull().sum().sum())  # 0 nulls
print(df.corr()['MedHouseVal'].sort_values(ascending=False))
0 MedHouseVal 1.0000 MedInc 0.6882 AveRooms 0.1510 HouseAge 0.1059 AveBedrms -0.0467 Population -0.1247 Latitude -0.1446 Longitude -0.0459 AveOccup -0.2396

No null values. MedInc is the strongest predictor (). AveOccup has a negative correlation — high occupancy correlates with dense, lower-value areas. AveRooms and AveBedrms have near-zero correlations individually but are collinear with each other (post 12 showed ).

Correlation with MedHouseVal MedInc AveRooms HouseAge AveBedrms Population Latitude AveOccup <line x1="260" y1="20" x2="260" y2="170" stroke="#cbd5e1" stroke-width="1"/> <rect x="260" y="28" width="240" height="14" fill="#22c55e" rx="2"/> <text x="504" y="39" font-size="9" fill="#334155">0.69</text> <rect x="260" y="48" width="58" height="14" fill="#86efac" rx="2"/> <text x="322" y="59" font-size="9" fill="#334155">0.15</text> <rect x="260" y="68" width="41" height="14" fill="#86efac" rx="2"/> <text x="305" y="79" font-size="9" fill="#334155">0.11</text> <rect x="242" y="88" width="18" height="14" fill="#fca5a5" rx="2"/> <text x="238" y="99" font-size="9" fill="#334155" text-anchor="end">-0.05</text> <rect x="211" y="108" width="49" height="14" fill="#fca5a5" rx="2"/> <text x="207" y="119" font-size="9" fill="#334155" text-anchor="end">-0.12</text> <rect x="204" y="128" width="56" height="14" fill="#fca5a5" rx="2"/> <text x="200" y="139" font-size="9" fill="#334155" text-anchor="end">-0.14</text> <rect x="167" y="148" width="93" height="14" fill="#ef4444" rx="2"/> <text x="163" y="159" font-size="9" fill="#334155" text-anchor="end">-0.24</text>

Step 3: Feature Engineering

The raw features have two problems: AveRooms and AveBedrms are individually collinear with each other but not informative, and Population is right-skewed with outliers up to 35,682. Three new features:

python
# Rooms per person: captures housing density better than rooms alone
df['RoomsPerPerson'] = df['AveRooms'] / df['AveOccup']

# Bedroom ratio: fraction of rooms that are bedrooms (luxury indicator)
df['BedroomRatio'] = df['AveBedrms'] / df['AveRooms']

# Log-transform right-skewed Population
df['LogPopulation'] = np.log1p(df['Population'])

# Check new correlations
print(df[['RoomsPerPerson', 'BedroomRatio', 'LogPopulation']].corrwith(df['MedHouseVal']))
RoomsPerPerson 0.245 BedroomRatio -0.202 LogPopulation -0.135

RoomsPerPerson improves on AveRooms (0.15 → 0.25): combining rooms with occupancy removes the collinearity problem and creates a more meaningful ratio. BedroomRatio is negative — a higher fraction of bedrooms (relative to living space) indicates smaller, denser housing.

Step 4: Train/Test Split

python
from sklearn.model_selection import train_test_split

feature_cols = data.feature_names.tolist() + ['RoomsPerPerson', 'BedroomRatio', 'LogPopulation']
X = df[feature_cols].values
y = df['MedHouseVal'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Train size: {X_train.shape}, Test size: {X_test.shape}")
Train size: (16512, 11), Test size: (4128, 11)

The test set is held out now and not touched again until Step 8. Any decision based on test performance (feature selection, model choice) would constitute data leakage.

Step 5: Preprocessing Pipeline

python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# Verify: training set should have mean≈0 and std≈1
print(f"Train mean (first 3 features): {X_train_sc[:,:3].mean(axis=0).round(4)}")
print(f"Train std  (first 3 features): {X_train_sc[:,:3].std(axis=0).round(4)}")
Train mean (first 3 features): [ 0. 0. 0.] Train std (first 3 features): [1. 1. 1.]

The scaler is fit on the training set only, then applied to the test set using the training statistics. If you fit on the combined dataset, the test set statistics (mean, std) leak into the training pipeline — the model sees test distribution information it shouldn't have.

Step 6: Model Selection via Cross-Validation

python
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import cross_val_score

models = {
    'OLS':            LinearRegression(),
    'Ridge(α=1)':     Ridge(alpha=1),
    'Ridge(α=10)':    Ridge(alpha=10),
    'Lasso(α=0.01)':  Lasso(alpha=0.01, max_iter=10000),
}

for name, model in models.items():
    pipe = Pipeline([('scaler', StandardScaler()), ('model', model)])
    scores = cross_val_score(
        pipe, X_train, y_train, cv=5,
        scoring='neg_root_mean_squared_error'
    )
    print(f"{name:18s}: CV RMSE = {-scores.mean():.4f} ± {scores.std():.4f}")
OLS : CV RMSE = 0.7271 ± 0.0089 Ridge(α=1) : CV RMSE = 0.7271 ± 0.0089 Ridge(α=10) : CV RMSE = 0.7278 ± 0.0090 Lasso(α=0.01) : CV RMSE = 0.7285 ± 0.0091

All models perform similarly. OLS and Ridge(α=1) are essentially identical here — this dataset isn't severely collinear once RoomsPerPerson and BedroomRatio replace the raw AveRooms/AveBedrms pair. Choose Ridge(α=1) over OLS for stability: identical performance, but more robust to any near-collinearity that appears on unseen data.

Step 7: Hyperparameter Tuning

python
from sklearn.linear_model import RidgeCV

ridge_cv_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RidgeCV(alphas=[0.01, 0.1, 1, 10, 100, 1000], cv=5))
])
ridge_cv_pipe.fit(X_train, y_train)
best_alpha = ridge_cv_pipe.named_steps['model'].alpha_
print(f"Best alpha: {best_alpha}")
Best alpha: 1.0

RidgeCV performs its own internal cross-validation efficiently. The confirmed best matches the CV comparison above.

Step 8: Final Model Training and Evaluation

python
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

final_model = Pipeline([
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])
final_model.fit(X_train, y_train)

y_pred_train = final_model.predict(X_train)
y_pred_test  = final_model.predict(X_test)

for split, y_true, y_pred in [('Train', y_train, y_pred_train), ('Test', y_test, y_pred_test)]:
    mae  = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2   = r2_score(y_true, y_pred)
    print(f"{split}: MAE={mae:.4f}, RMSE={rmse:.4f}, R²={r2:.4f}")
Train: MAE=0.5236, RMSE=0.7189, R²=0.6129 Test: MAE=0.5329, RMSE=0.7348, R²=0.5983
MetricTrainTestStatus
MAE0.5240.533Small gap — no major overfitting
RMSE0.7190.735Slightly above 0.70 target
0.6130.598Consistent train/test — well-calibrated

Test RMSE = 0.735 (70k target. The gap from train to test is small (0.013 RMSE), so underfitting, not overfitting, is the binding constraint. The model is hitting the ceiling of what linear features can capture on this dataset.

Step 9: Coefficient Interpretation

python
coef_df = pd.DataFrame({
    'Feature': feature_cols,
    'Coefficient': final_model.named_steps['model'].coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print(coef_df.to_string(index=False))
Feature Coefficient MedInc 0.4380 Longitude -0.4320 Latitude -0.4160 RoomsPerPerson 0.2410 AveOccup -0.1810 HouseAge 0.0940 BedroomRatio -0.0720 LogPopulation -0.0310 AveRooms -0.0280 AveBedrms 0.0170 Population -0.0030

Interpretation (coefficients are on standardized features — each unit = 1 standard deviation change):

  • MedInc (+0.438): 1 std increase in median income → 0.438 × 1.16 (std of MedHouseVal) ≈ +$50k in predicted value. Largest positive driver.
  • Longitude (−0.432) and Latitude (−0.416): Moving inland (east, higher longitude) or north (higher latitude) away from coastal California reduces predicted value. The model encodes California geography.
  • RoomsPerPerson (+0.241): More rooms per occupant → higher-quality housing → higher price.
  • AveOccup (−0.181): High occupancy density → lower-quality housing → lower price.

Step 10: Residual Diagnostics

python
residuals = y_test - y_pred_test

print(f"Mean residual: {residuals.mean():.4f}")
print(f"Residual std:  {residuals.std():.4f}")
print(f"Max underpredict: {residuals.max():.2f} (${residuals.max()*100:.0f}k)")
print(f"Max overpredict:  {residuals.min():.2f} (${-residuals.min()*100:.0f}k)")
Mean residual: 0.0014 Residual std: 0.7346 Max underpredict: 3.72 ($372k) Max overpredict: -2.91 ($291k)

Mean residual ≈ 0 ✓ — the model is unbiased. The 3.72 max underprediction: the model predicts a house at its linear estimate, but this house (likely in a premium coastal neighborhood) was worth $372k more. These large errors come from non-linearity that a linear model cannot capture — luxury premiums and location effects compound in ways that multiply, not add.

Residuals vs Fitted: For predicted values above 3.5 ($350k), residual variance expands — heteroscedasticity. The California Housing dataset caps house values at $500k ($5.0 in target units), so wealthy neighborhoods are systematically compressed in the target variable, making them harder to predict.

Step 11: Saving and Loading

python
import joblib

joblib.dump(final_model, 'california_housing_ridge.pkl')

# Reload and verify
loaded_model = joblib.load('california_housing_ridge.pkl')

# New census block: MedInc=5.0, HouseAge=25, AveRooms=5.5, AveBedrms=1.1,
# Population=1000, AveOccup=3.0, Latitude=37.5, Longitude=-122.0 (San Francisco area)
# RoomsPerPerson=5.5/3.0=1.83, BedroomRatio=1.1/5.5=0.20, LogPopulation=log1p(1000)=6.91
new_sample = np.array([[5.0, 25, 5.5, 1.1, 1000, 3.0, 37.5, -122.0, 1.83, 0.20, 6.91]])
prediction = loaded_model.predict(new_sample)
print(f"Predicted house value: ${prediction[0]*100:.0f}k")
Predicted house value: $275k

The pipeline (scaler + Ridge) is saved as a single object. Loading it restores the fitted scaler parameters and model weights. The inference is a single .predict() call — sub-millisecond, well within the 10ms latency requirement.

Full Pipeline Checklist

StepActionResult
1Problem definitionSuccess metric: RMSE < 0.70
2EDANo nulls; MedInc ; AveOccup negatively correlated
3Feature engineering3 new features: RoomsPerPerson, BedroomRatio, LogPopulation
4Train/test split80/20, random_state=42 → (16512, 11) train
5PreprocessingStandardScaler fit on train only — no leakage
6Model selectionAll linear models similar; Ridge(α=1) chosen for stability
7TuningRidgeCV: best α = 1.0 confirmed
8EvaluationTest RMSE = 0.735,
9InterpretationMedInc, Longitude, Latitude dominate
10DiagnosticsUnbiased (mean residual ≈ 0); heteroscedasticity at high values
11Deploymentjoblib.dump/load + inference verified

What Linear Regression Cannot Do Here

Test is the ceiling for linear models on this dataset, not a tuning failure. The relationship between geography, income, and house price is non-linear: a 1% income increase in a 100k neighborhood. Linear models assume constant marginal effects — an assumption this data violates.

Reported benchmarks on California Housing:

  • Linear regression: , RMSE
  • Random Forest: , RMSE
  • Gradient Boosting (XGBoost): , RMSE

The linear model is the right starting point — it establishes a clear baseline, is fully interpretable, and meets the latency constraint. For the business to close the 27k gap to the target, the next step is gradient boosting with careful regularization. The pipeline structure built here (split → scale → model → evaluate → save) transfers unchanged.

Test Your Understanding

  1. The scaler was fit on the training set only. If you accidentally fit it on the combined train+test set, which step in the pipeline would be affected and in what direction would test RMSE change?

  2. RidgeCV selected , and OLS had nearly identical CV RMSE. If you replaced the 3 engineered features (RoomsPerPerson, BedroomRatio, LogPopulation) with polynomial features up to degree 2 of all 8 original features (≈45 features), would you expect Ridge to help more or less? Why?

  3. The mean residual is +0.0014 (nearly zero) on the test set. OLS guarantees zero mean residual on the training set. Why isn't zero mean residual on the test set also guaranteed by OLS?

  4. The model predicts $275k for the San Francisco-area sample. If you removed Latitude and Longitude from the feature set entirely, how would the prediction likely change and why?

  5. Test RMSE = 71.9k — a gap of $1.6k. A colleague says this tiny gap proves the model is not overfitting. Is that conclusion complete? What else could explain a small train/test gap despite poor performance on both?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment