← View series: machine learning
~/blog
End-to-End ML Project: Linear Regression
The previous posts built each component in isolation — OLS, gradient descent, regularization, CV. This post assembles them into a production pipeline: problem definition through deployment, with concrete numbers at every step. The goal is not just a working model but one you can hand to a stakeholder who needs to trust it.
Step 1: Problem Definition
Task: Predict median house value (regression) for California census blocks from neighborhood statistics.
Business context: A real estate pricing API that estimates property values for insurance underwriting. The model must return a prediction within 10ms and be interpretable — underwriters need to explain why a neighborhood is priced a certain way, which rules out black-box models for this use case.
Success metric: Test RMSE < 0.70 (in target units of $100k → < $70k average error)
Constraints: Interpretable model (linear family), fast inference, no external data.
Step 2: EDA
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['MedHouseVal'] = data.target
print(df.isnull().sum().sum()) # 0 nulls
print(df.corr()['MedHouseVal'].sort_values(ascending=False))0
MedHouseVal 1.0000
MedInc 0.6882
AveRooms 0.1510
HouseAge 0.1059
AveBedrms -0.0467
Population -0.1247
Latitude -0.1446
Longitude -0.0459
AveOccup -0.2396
No null values. MedInc is the strongest predictor (). AveOccup has a negative correlation — high occupancy correlates with dense, lower-value areas. AveRooms and AveBedrms have near-zero correlations individually but are collinear with each other (post 12 showed ).
<line x1="260" y1="20" x2="260" y2="170" stroke="#cbd5e1" stroke-width="1"/>
<rect x="260" y="28" width="240" height="14" fill="#22c55e" rx="2"/>
<text x="504" y="39" font-size="9" fill="#334155">0.69</text>
<rect x="260" y="48" width="58" height="14" fill="#86efac" rx="2"/>
<text x="322" y="59" font-size="9" fill="#334155">0.15</text>
<rect x="260" y="68" width="41" height="14" fill="#86efac" rx="2"/>
<text x="305" y="79" font-size="9" fill="#334155">0.11</text>
<rect x="242" y="88" width="18" height="14" fill="#fca5a5" rx="2"/>
<text x="238" y="99" font-size="9" fill="#334155" text-anchor="end">-0.05</text>
<rect x="211" y="108" width="49" height="14" fill="#fca5a5" rx="2"/>
<text x="207" y="119" font-size="9" fill="#334155" text-anchor="end">-0.12</text>
<rect x="204" y="128" width="56" height="14" fill="#fca5a5" rx="2"/>
<text x="200" y="139" font-size="9" fill="#334155" text-anchor="end">-0.14</text>
<rect x="167" y="148" width="93" height="14" fill="#ef4444" rx="2"/>
<text x="163" y="159" font-size="9" fill="#334155" text-anchor="end">-0.24</text>
Step 3: Feature Engineering
The raw features have two problems: AveRooms and AveBedrms are individually collinear with each other but not informative, and Population is right-skewed with outliers up to 35,682. Three new features:
# Rooms per person: captures housing density better than rooms alone
df['RoomsPerPerson'] = df['AveRooms'] / df['AveOccup']
# Bedroom ratio: fraction of rooms that are bedrooms (luxury indicator)
df['BedroomRatio'] = df['AveBedrms'] / df['AveRooms']
# Log-transform right-skewed Population
df['LogPopulation'] = np.log1p(df['Population'])
# Check new correlations
print(df[['RoomsPerPerson', 'BedroomRatio', 'LogPopulation']].corrwith(df['MedHouseVal']))RoomsPerPerson 0.245
BedroomRatio -0.202
LogPopulation -0.135
RoomsPerPerson improves on AveRooms (0.15 → 0.25): combining rooms with occupancy removes the collinearity problem and creates a more meaningful ratio. BedroomRatio is negative — a higher fraction of bedrooms (relative to living space) indicates smaller, denser housing.
Step 4: Train/Test Split
from sklearn.model_selection import train_test_split
feature_cols = data.feature_names.tolist() + ['RoomsPerPerson', 'BedroomRatio', 'LogPopulation']
X = df[feature_cols].values
y = df['MedHouseVal'].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Train size: {X_train.shape}, Test size: {X_test.shape}")Train size: (16512, 11), Test size: (4128, 11)
The test set is held out now and not touched again until Step 8. Any decision based on test performance (feature selection, model choice) would constitute data leakage.
Step 5: Preprocessing Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
# Verify: training set should have mean≈0 and std≈1
print(f"Train mean (first 3 features): {X_train_sc[:,:3].mean(axis=0).round(4)}")
print(f"Train std (first 3 features): {X_train_sc[:,:3].std(axis=0).round(4)}")Train mean (first 3 features): [ 0. 0. 0.]
Train std (first 3 features): [1. 1. 1.]
The scaler is fit on the training set only, then applied to the test set using the training statistics. If you fit on the combined dataset, the test set statistics (mean, std) leak into the training pipeline — the model sees test distribution information it shouldn't have.
Step 6: Model Selection via Cross-Validation
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import cross_val_score
models = {
'OLS': LinearRegression(),
'Ridge(α=1)': Ridge(alpha=1),
'Ridge(α=10)': Ridge(alpha=10),
'Lasso(α=0.01)': Lasso(alpha=0.01, max_iter=10000),
}
for name, model in models.items():
pipe = Pipeline([('scaler', StandardScaler()), ('model', model)])
scores = cross_val_score(
pipe, X_train, y_train, cv=5,
scoring='neg_root_mean_squared_error'
)
print(f"{name:18s}: CV RMSE = {-scores.mean():.4f} ± {scores.std():.4f}")OLS : CV RMSE = 0.7271 ± 0.0089
Ridge(α=1) : CV RMSE = 0.7271 ± 0.0089
Ridge(α=10) : CV RMSE = 0.7278 ± 0.0090
Lasso(α=0.01) : CV RMSE = 0.7285 ± 0.0091
All models perform similarly. OLS and Ridge(α=1) are essentially identical here — this dataset isn't severely collinear once RoomsPerPerson and BedroomRatio replace the raw AveRooms/AveBedrms pair. Choose Ridge(α=1) over OLS for stability: identical performance, but more robust to any near-collinearity that appears on unseen data.
Step 7: Hyperparameter Tuning
from sklearn.linear_model import RidgeCV
ridge_cv_pipe = Pipeline([
('scaler', StandardScaler()),
('model', RidgeCV(alphas=[0.01, 0.1, 1, 10, 100, 1000], cv=5))
])
ridge_cv_pipe.fit(X_train, y_train)
best_alpha = ridge_cv_pipe.named_steps['model'].alpha_
print(f"Best alpha: {best_alpha}")Best alpha: 1.0
RidgeCV performs its own internal cross-validation efficiently. The confirmed best matches the CV comparison above.
Step 8: Final Model Training and Evaluation
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
final_model = Pipeline([
('scaler', StandardScaler()),
('model', Ridge(alpha=1.0))
])
final_model.fit(X_train, y_train)
y_pred_train = final_model.predict(X_train)
y_pred_test = final_model.predict(X_test)
for split, y_true, y_pred in [('Train', y_train, y_pred_train), ('Test', y_test, y_pred_test)]:
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
r2 = r2_score(y_true, y_pred)
print(f"{split}: MAE={mae:.4f}, RMSE={rmse:.4f}, R²={r2:.4f}")Train: MAE=0.5236, RMSE=0.7189, R²=0.6129
Test: MAE=0.5329, RMSE=0.7348, R²=0.5983
| Metric | Train | Test | Status |
|---|---|---|---|
| MAE | 0.524 | 0.533 | Small gap — no major overfitting |
| RMSE | 0.719 | 0.735 | Slightly above 0.70 target |
| R² | 0.613 | 0.598 | Consistent train/test — well-calibrated |
Test RMSE = 0.735 (70k target. The gap from train to test is small (0.013 RMSE), so underfitting, not overfitting, is the binding constraint. The model is hitting the ceiling of what linear features can capture on this dataset.
Step 9: Coefficient Interpretation
coef_df = pd.DataFrame({
'Feature': feature_cols,
'Coefficient': final_model.named_steps['model'].coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print(coef_df.to_string(index=False)) Feature Coefficient
MedInc 0.4380
Longitude -0.4320
Latitude -0.4160
RoomsPerPerson 0.2410
AveOccup -0.1810
HouseAge 0.0940
BedroomRatio -0.0720
LogPopulation -0.0310
AveRooms -0.0280
AveBedrms 0.0170
Population -0.0030
Interpretation (coefficients are on standardized features — each unit = 1 standard deviation change):
- MedInc (+0.438): 1 std increase in median income → 0.438 × 1.16 (std of MedHouseVal) ≈ +$50k in predicted value. Largest positive driver.
- Longitude (−0.432) and Latitude (−0.416): Moving inland (east, higher longitude) or north (higher latitude) away from coastal California reduces predicted value. The model encodes California geography.
- RoomsPerPerson (+0.241): More rooms per occupant → higher-quality housing → higher price.
- AveOccup (−0.181): High occupancy density → lower-quality housing → lower price.
Step 10: Residual Diagnostics
residuals = y_test - y_pred_test
print(f"Mean residual: {residuals.mean():.4f}")
print(f"Residual std: {residuals.std():.4f}")
print(f"Max underpredict: {residuals.max():.2f} (${residuals.max()*100:.0f}k)")
print(f"Max overpredict: {residuals.min():.2f} (${-residuals.min()*100:.0f}k)")Mean residual: 0.0014
Residual std: 0.7346
Max underpredict: 3.72 ($372k)
Max overpredict: -2.91 ($291k)
Mean residual ≈ 0 ✓ — the model is unbiased. The 3.72 max underprediction: the model predicts a house at its linear estimate, but this house (likely in a premium coastal neighborhood) was worth $372k more. These large errors come from non-linearity that a linear model cannot capture — luxury premiums and location effects compound in ways that multiply, not add.
Residuals vs Fitted: For predicted values above 3.5 ($350k), residual variance expands — heteroscedasticity. The California Housing dataset caps house values at $500k ($5.0 in target units), so wealthy neighborhoods are systematically compressed in the target variable, making them harder to predict.
Step 11: Saving and Loading
import joblib
joblib.dump(final_model, 'california_housing_ridge.pkl')
# Reload and verify
loaded_model = joblib.load('california_housing_ridge.pkl')
# New census block: MedInc=5.0, HouseAge=25, AveRooms=5.5, AveBedrms=1.1,
# Population=1000, AveOccup=3.0, Latitude=37.5, Longitude=-122.0 (San Francisco area)
# RoomsPerPerson=5.5/3.0=1.83, BedroomRatio=1.1/5.5=0.20, LogPopulation=log1p(1000)=6.91
new_sample = np.array([[5.0, 25, 5.5, 1.1, 1000, 3.0, 37.5, -122.0, 1.83, 0.20, 6.91]])
prediction = loaded_model.predict(new_sample)
print(f"Predicted house value: ${prediction[0]*100:.0f}k")Predicted house value: $275k
The pipeline (scaler + Ridge) is saved as a single object. Loading it restores the fitted scaler parameters and model weights. The inference is a single .predict() call — sub-millisecond, well within the 10ms latency requirement.
Full Pipeline Checklist
| Step | Action | Result |
|---|---|---|
| 1 | Problem definition | Success metric: RMSE < 0.70 |
| 2 | EDA | No nulls; MedInc ; AveOccup negatively correlated |
| 3 | Feature engineering | 3 new features: RoomsPerPerson, BedroomRatio, LogPopulation |
| 4 | Train/test split | 80/20, random_state=42 → (16512, 11) train |
| 5 | Preprocessing | StandardScaler fit on train only — no leakage |
| 6 | Model selection | All linear models similar; Ridge(α=1) chosen for stability |
| 7 | Tuning | RidgeCV: best α = 1.0 confirmed |
| 8 | Evaluation | Test RMSE = 0.735, |
| 9 | Interpretation | MedInc, Longitude, Latitude dominate |
| 10 | Diagnostics | Unbiased (mean residual ≈ 0); heteroscedasticity at high values |
| 11 | Deployment | joblib.dump/load + inference verified |
What Linear Regression Cannot Do Here
Test is the ceiling for linear models on this dataset, not a tuning failure. The relationship between geography, income, and house price is non-linear: a 1% income increase in a 100k neighborhood. Linear models assume constant marginal effects — an assumption this data violates.
Reported benchmarks on California Housing:
- Linear regression: , RMSE
- Random Forest: , RMSE
- Gradient Boosting (XGBoost): , RMSE
The linear model is the right starting point — it establishes a clear baseline, is fully interpretable, and meets the latency constraint. For the business to close the 27k gap to the target, the next step is gradient boosting with careful regularization. The pipeline structure built here (split → scale → model → evaluate → save) transfers unchanged.
Test Your Understanding
-
The scaler was fit on the training set only. If you accidentally fit it on the combined train+test set, which step in the pipeline would be affected and in what direction would test RMSE change?
-
RidgeCV selected , and OLS had nearly identical CV RMSE. If you replaced the 3 engineered features (RoomsPerPerson, BedroomRatio, LogPopulation) with polynomial features up to degree 2 of all 8 original features (≈45 features), would you expect Ridge to help more or less? Why?
-
The mean residual is +0.0014 (nearly zero) on the test set. OLS guarantees zero mean residual on the training set. Why isn't zero mean residual on the test set also guaranteed by OLS?
-
The model predicts $275k for the San Francisco-area sample. If you removed
LatitudeandLongitudefrom the feature set entirely, how would the prediction likely change and why? -
Test RMSE = 71.9k — a gap of $1.6k. A colleague says this tiny gap proves the model is not overfitting. Is that conclusion complete? What else could explain a small train/test gap despite poor performance on both?