Back to blog
← View series: machine learning

Types of Machine Learning Equation of a Line, 3D Plane, and Hyperplane Distance of a Point from a Plane Instance-Based vs Model-Based Learning Simple Linear Regression Cost Function in Linear Regression Gradient Descent Multiple Linear Regression Performance Metrics for Regression Overfitting and Underfitting Linear Regression OLS: The Normal Equation Practicals: Simple and Multiple Linear Regression Polynomial Regression Ridge, Lasso, and ElasticNet Regression Cross-Validation End-to-End ML Project: Linear Regression

~/blog

End-to-End ML Project: Linear Regression

Jun 25, 2026•8 min read•By Mohammed Vasim

Machine LearningAIData Science

The previous posts built each component in isolation — OLS, gradient descent, regularization, CV. This post assembles them into a production pipeline: problem definition through deployment, with concrete numbers at every step. The goal is not just a working model but one you can hand to a stakeholder who needs to trust it.

Step 1: Problem Definition

Task: Predict median house value (regression) for California census blocks from neighborhood statistics.

Business context: A real estate pricing API that estimates property values for insurance underwriting. The model must return a prediction within 10ms and be interpretable — underwriters need to explain why a neighborhood is priced a certain way, which rules out black-box models for this use case.

Success metric: Test RMSE < 0.70 (in target units of $100k → < $70k average error)

Constraints: Interpretable model (linear family), fast inference, no external data.

Step 2: EDA

python

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['MedHouseVal'] = data.target

print(df.isnull().sum().sum())  # 0 nulls
print(df.corr()['MedHouseVal'].sort_values(ascending=False))

0

MedHouseVal      1.0000
MedInc           0.6882
AveRooms         0.1510
HouseAge         0.1059
AveBedrms       -0.0467
Population      -0.1247
Latitude        -0.1446
Longitude       -0.0459
AveOccup        -0.2396

No null values. MedInc is the strongest predictor ( $r = 0.69$ ). AveOccup has a negative correlation — high occupancy correlates with dense, lower-value areas. AveRooms and AveBedrms have near-zero correlations individually but are collinear with each other (post 12 showed $r = 0.85$ ).

<line x1="260" y1="20" x2="260" y2="170" stroke="#cbd5e1" stroke-width="1"/>

<rect x="260" y="28" width="240" height="14" fill="#22c55e" rx="2"/>
<text x="504" y="39" font-size="9" fill="#334155">0.69</text>
<rect x="260" y="48" width="58" height="14" fill="#86efac" rx="2"/>
<text x="322" y="59" font-size="9" fill="#334155">0.15</text>
<rect x="260" y="68" width="41" height="14" fill="#86efac" rx="2"/>
<text x="305" y="79" font-size="9" fill="#334155">0.11</text>
<rect x="242" y="88" width="18" height="14" fill="#fca5a5" rx="2"/>
<text x="238" y="99" font-size="9" fill="#334155" text-anchor="end">-0.05</text>
<rect x="211" y="108" width="49" height="14" fill="#fca5a5" rx="2"/>
<text x="207" y="119" font-size="9" fill="#334155" text-anchor="end">-0.12</text>
<rect x="204" y="128" width="56" height="14" fill="#fca5a5" rx="2"/>
<text x="200" y="139" font-size="9" fill="#334155" text-anchor="end">-0.14</text>
<rect x="167" y="148" width="93" height="14" fill="#ef4444" rx="2"/>
<text x="163" y="159" font-size="9" fill="#334155" text-anchor="end">-0.24</text>

Step 3: Feature Engineering

The raw features have two problems: AveRooms and AveBedrms are individually collinear with each other but not informative, and Population is right-skewed with outliers up to 35,682. Three new features:

python

# Rooms per person: captures housing density better than rooms alone
df['RoomsPerPerson'] = df['AveRooms'] / df['AveOccup']

# Bedroom ratio: fraction of rooms that are bedrooms (luxury indicator)
df['BedroomRatio'] = df['AveBedrms'] / df['AveRooms']

# Log-transform right-skewed Population
df['LogPopulation'] = np.log1p(df['Population'])

# Check new correlations
print(df[['RoomsPerPerson', 'BedroomRatio', 'LogPopulation']].corrwith(df['MedHouseVal']))

RoomsPerPerson    0.245
BedroomRatio     -0.202
LogPopulation    -0.135

RoomsPerPerson improves on AveRooms (0.15 → 0.25): combining rooms with occupancy removes the collinearity problem and creates a more meaningful ratio. BedroomRatio is negative — a higher fraction of bedrooms (relative to living space) indicates smaller, denser housing.

Step 4: Train/Test Split

python

from sklearn.model_selection import train_test_split

feature_cols = data.feature_names.tolist() + ['RoomsPerPerson', 'BedroomRatio', 'LogPopulation']
X = df[feature_cols].values
y = df['MedHouseVal'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Train size: {X_train.shape}, Test size: {X_test.shape}")

Train size: (16512, 11), Test size: (4128, 11)

The test set is held out now and not touched again until Step 8. Any decision based on test performance (feature selection, model choice) would constitute data leakage.

Step 5: Preprocessing Pipeline

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# Verify: training set should have mean≈0 and std≈1
print(f"Train mean (first 3 features): {X_train_sc[:,:3].mean(axis=0).round(4)}")
print(f"Train std  (first 3 features): {X_train_sc[:,:3].std(axis=0).round(4)}")

Train mean (first 3 features): [ 0.  0.  0.]
Train std  (first 3 features): [1. 1. 1.]

The scaler is fit on the training set only, then applied to the test set using the training statistics. If you fit on the combined dataset, the test set statistics (mean, std) leak into the training pipeline — the model sees test distribution information it shouldn't have.

Step 6: Model Selection via Cross-Validation

python

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import cross_val_score

models = {
    'OLS':            LinearRegression(),
    'Ridge(α=1)':     Ridge(alpha=1),
    'Ridge(α=10)':    Ridge(alpha=10),
    'Lasso(α=0.01)':  Lasso(alpha=0.01, max_iter=10000),
}

for name, model in models.items():
    pipe = Pipeline([('scaler', StandardScaler()), ('model', model)])
    scores = cross_val_score(
        pipe, X_train, y_train, cv=5,
        scoring='neg_root_mean_squared_error'
    )
    print(f"{name:18s}: CV RMSE = {-scores.mean():.4f} ± {scores.std():.4f}")

OLS              : CV RMSE = 0.7271 ± 0.0089
Ridge(α=1)       : CV RMSE = 0.7271 ± 0.0089
Ridge(α=10)      : CV RMSE = 0.7278 ± 0.0090
Lasso(α=0.01)    : CV RMSE = 0.7285 ± 0.0091

All models perform similarly. OLS and Ridge(α=1) are essentially identical here — this dataset isn't severely collinear once RoomsPerPerson and BedroomRatio replace the raw AveRooms/AveBedrms pair. Choose Ridge(α=1) over OLS for stability: identical performance, but more robust to any near-collinearity that appears on unseen data.

Step 7: Hyperparameter Tuning

python

from sklearn.linear_model import RidgeCV

ridge_cv_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RidgeCV(alphas=[0.01, 0.1, 1, 10, 100, 1000], cv=5))
])
ridge_cv_pipe.fit(X_train, y_train)
best_alpha = ridge_cv_pipe.named_steps['model'].alpha_
print(f"Best alpha: {best_alpha}")

Best alpha: 1.0

RidgeCV performs its own internal cross-validation efficiently. The confirmed best $α = 1.0$ matches the CV comparison above.

Step 8: Final Model Training and Evaluation

python

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

final_model = Pipeline([
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])
final_model.fit(X_train, y_train)

y_pred_train = final_model.predict(X_train)
y_pred_test  = final_model.predict(X_test)

for split, y_true, y_pred in [('Train', y_train, y_pred_train), ('Test', y_test, y_pred_test)]:
    mae  = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2   = r2_score(y_true, y_pred)
    print(f"{split}: MAE={mae:.4f}, RMSE={rmse:.4f}, R²={r2:.4f}")

Train: MAE=0.5236, RMSE=0.7189, R²=0.6129
Test:  MAE=0.5329, RMSE=0.7348, R²=0.5983

Metric	Train	Test	Status
MAE	0.524	0.533	Small gap — no major overfitting
RMSE	0.719	0.735	Slightly above 0.70 target
R²	0.613	0.598	Consistent train/test — well-calibrated

Test RMSE = 0.735 ( $73.5 k a v er a g eer r or) — s l i g h tl y ab o v e t h e$ 70k target. The gap from train to test is small (0.013 RMSE), so underfitting, not overfitting, is the binding constraint. The model is hitting the ceiling of what linear features can capture on this dataset.

Step 9: Coefficient Interpretation

python

coef_df = pd.DataFrame({
    'Feature': feature_cols,
    'Coefficient': final_model.named_steps['model'].coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print(coef_df.to_string(index=False))

       Feature  Coefficient
        MedInc       0.4380
     Longitude      -0.4320
      Latitude      -0.4160
RoomsPerPerson       0.2410
      AveOccup      -0.1810
      HouseAge       0.0940
  BedroomRatio      -0.0720
  LogPopulation     -0.0310
      AveRooms      -0.0280
     AveBedrms       0.0170
    Population      -0.0030

Interpretation (coefficients are on standardized features — each unit = 1 standard deviation change):

MedInc (+0.438): 1 std increase in median income → 0.438 × 1.16 (std of MedHouseVal) ≈ +$50k in predicted value. Largest positive driver.
Longitude (−0.432) and Latitude (−0.416): Moving inland (east, higher longitude) or north (higher latitude) away from coastal California reduces predicted value. The model encodes California geography.
RoomsPerPerson (+0.241): More rooms per occupant → higher-quality housing → higher price.
AveOccup (−0.181): High occupancy density → lower-quality housing → lower price.

Step 10: Residual Diagnostics

python

residuals = y_test - y_pred_test

print(f"Mean residual: {residuals.mean():.4f}")
print(f"Residual std:  {residuals.std():.4f}")
print(f"Max underpredict: {residuals.max():.2f} (${residuals.max()*100:.0f}k)")
print(f"Max overpredict:  {residuals.min():.2f} (${-residuals.min()*100:.0f}k)")

Mean residual:  0.0014
Residual std:   0.7346
Max underpredict:  3.72 ($372k)
Max overpredict:  -2.91 ($291k)

Mean residual ≈ 0 ✓ — the model is unbiased. The 3.72 max underprediction: the model predicts a house at its linear estimate, but this house (likely in a premium coastal neighborhood) was worth $372k more. These large errors come from non-linearity that a linear model cannot capture — luxury premiums and location effects compound in ways that multiply, not add.

Residuals vs Fitted: For predicted values above 3.5 ($350k), residual variance expands — heteroscedasticity. The California Housing dataset caps house values at $500k ($5.0 in target units), so wealthy neighborhoods are systematically compressed in the target variable, making them harder to predict.

Step 11: Saving and Loading

python

import joblib

joblib.dump(final_model, 'california_housing_ridge.pkl')

# Reload and verify
loaded_model = joblib.load('california_housing_ridge.pkl')

# New census block: MedInc=5.0, HouseAge=25, AveRooms=5.5, AveBedrms=1.1,
# Population=1000, AveOccup=3.0, Latitude=37.5, Longitude=-122.0 (San Francisco area)
# RoomsPerPerson=5.5/3.0=1.83, BedroomRatio=1.1/5.5=0.20, LogPopulation=log1p(1000)=6.91
new_sample = np.array([[5.0, 25, 5.5, 1.1, 1000, 3.0, 37.5, -122.0, 1.83, 0.20, 6.91]])
prediction = loaded_model.predict(new_sample)
print(f"Predicted house value: ${prediction[0]*100:.0f}k")

Predicted house value: $275k

The pipeline (scaler + Ridge) is saved as a single object. Loading it restores the fitted scaler parameters and model weights. The inference is a single .predict() call — sub-millisecond, well within the 10ms latency requirement.

Full Pipeline Checklist

Step	Action	Result
1	Problem definition	Success metric: RMSE < 0.70
2	EDA	No nulls; MedInc $r = 0.69$ ; AveOccup negatively correlated
3	Feature engineering	3 new features: RoomsPerPerson, BedroomRatio, LogPopulation
4	Train/test split	80/20, `random_state=42` → (16512, 11) train
5	Preprocessing	StandardScaler fit on train only — no leakage
6	Model selection	All linear models similar; Ridge(α=1) chosen for stability
7	Tuning	RidgeCV: best α = 1.0 confirmed
8	Evaluation	Test RMSE = 0.735, $R^{2} = 0.598$
9	Interpretation	MedInc, Longitude, Latitude dominate
10	Diagnostics	Unbiased (mean residual ≈ 0); heteroscedasticity at high values
11	Deployment	`joblib.dump/load` + inference verified

What Linear Regression Cannot Do Here

Test $R^{2} = 0.60$ is the ceiling for linear models on this dataset, not a tuning failure. The relationship between geography, income, and house price is non-linear: a 1% income increase in a $500 k + n e i g hb or h oo d a dd s m or e ab so l u t e v a l u e t han t h es am e in cr e a se ina$ 100k neighborhood. Linear models assume constant marginal effects — an assumption this data violates.

Reported benchmarks on California Housing:

Linear regression: $R^{2} \approx 0.60$ , RMSE $\approx 0.73$
Random Forest: $R^{2} \approx 0.80$ , RMSE $\approx 0.52$
Gradient Boosting (XGBoost): $R^{2} \approx 0.84$ , RMSE $\approx 0.46$

The linear model is the right starting point — it establishes a clear baseline, is fully interpretable, and meets the latency constraint. For the business to close the $3.5 k -$ 27k gap to the target, the next step is gradient boosting with careful regularization. The pipeline structure built here (split → scale → model → evaluate → save) transfers unchanged.

Test Your Understanding

The scaler was fit on the training set only. If you accidentally fit it on the combined train+test set, which step in the pipeline would be affected and in what direction would test RMSE change?
RidgeCV selected $α = 1.0$ , and OLS had nearly identical CV RMSE. If you replaced the 3 engineered features (RoomsPerPerson, BedroomRatio, LogPopulation) with polynomial features up to degree 2 of all 8 original features (≈45 features), would you expect Ridge to help more or less? Why?
The mean residual is +0.0014 (nearly zero) on the test set. OLS guarantees zero mean residual on the training set. Why isn't zero mean residual on the test set also guaranteed by OLS?
The model predicts $275k for the San Francisco-area sample. If you removed Latitude and Longitude from the feature set entirely, how would the prediction likely change and why?
Test RMSE = $73.5 k v s t r ain R M S E =$ 71.9k — a gap of $1.6k. A colleague says this tiny gap proves the model is not overfitting. Is that conclusion complete? What else could explain a small train/test gap despite poor performance on both?

End-to-End ML Project: Linear Regression

Step 1: Problem Definition

Step 2: EDA

Step 3: Feature Engineering

Step 4: Train/Test Split

Step 5: Preprocessing Pipeline

Step 6: Model Selection via Cross-Validation

Step 7: Hyperparameter Tuning

Step 8: Final Model Training and Evaluation

Step 9: Coefficient Interpretation

Step 10: Residual Diagnostics

Step 11: Saving and Loading

Full Pipeline Checklist

What Linear Regression Cannot Do Here

Test Your Understanding

Comments (0)

Leave a comment