Back to blog
← View series: machine learning

~/blog

Cross-Validation

Jun 25, 20269 min readBy Mohammed Vasim
Machine LearningAIData Science

A single train/test split answers one question: how does this model perform on this particular partition of this particular dataset? Cross-validation answers the more useful question: how does this model perform on average, across all the ways you could have split the data? The difference matters most on small datasets, where a single split can be dominated by luck.

Anchor dataset: 6-sample house prices — small enough to trace every fold by hand.

python
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score, LeaveOneOut

X = np.array([650, 850, 1100, 1400, 1600, 1900]).reshape(-1, 1)
y = np.array([180, 220, 280, 340, 370, 430])

The Problem with a Single Train/Test Split

With 6 samples and a 4/2 train/test split, the test set is just 2 data points. The MSE of those 2 predictions is your entire generalization estimate. If the test set happens to contain the two extreme samples (650 and 1900), the RMSE will be inflated. If both test samples fall in the middle (1100, 1400), the model looks better than it is.

This is high-variance estimation: the metric changes substantially depending on which 2 samples land in the test set. Cross-validation fixes this by using every sample as a test sample exactly once.

K-Fold Cross-Validation — Manual Trace

With and : each fold trains on 4 samples and tests on 2. The 6 samples are partitioned into 3 non-overlapping groups.

FoldTrain indicesTest indicesTest XTest y
1[2,3,4,5][0,1][650, 850][180, 220]
2[0,1,4,5][2,3][1100, 1400][280, 340]
3[0,1,2,3][4,5][1600, 1900][370, 430]

Fold 1 trace — fit on samples 2–5:

Training: ,

Test predictions:

  • : , ,
  • : , ,

Fold 1 MSE

Fold 3 trace — fit on samples 0–3:

Training: ,

Test predictions:

  • : , ,
  • : , ,

Fold 3 MSE

The fold MSEs vary considerably (Fold 1 > Fold 3) because 2-sample test sets are noisy. The CV estimate averages this variation out. These fold MSEs are also larger than the full training MSE of 22.2 — expected, because each fold trains on only 4 of 6 samples.

sklearn Implementation

python
kf = KFold(n_splits=3, shuffle=False)

mse_per_fold = []
for fold_idx, (train_idx, test_idx) in enumerate(kf.split(X)):
    X_tr, X_te = X[train_idx], X[test_idx]
    y_tr, y_te = y[train_idx], y[test_idx]
    
    model = LinearRegression()
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_te)
    fold_mse = ((y_te - y_pred) ** 2).mean()
    mse_per_fold.append(fold_mse)
    print(f"Fold {fold_idx+1}: Test MSE = {fold_mse:.2f}")

print(f"CV MSE = {np.mean(mse_per_fold):.2f} ± {np.std(mse_per_fold):.2f}")
Fold 1: Test MSE = 516.00 Fold 2: Test MSE = 112.00 Fold 3: Test MSE = 226.00 CV MSE = 284.67 ± 168.21

The ±168 standard deviation on a mean of 284 shows how much fold-to-fold variance we have with only 2 test samples per fold. This is the signal to either increase or use LOOCV.

cross_val_score — Convenient Wrapper

python
scores = cross_val_score(
    LinearRegression(), X, y,
    cv=3,
    scoring='neg_mean_squared_error'
)
cv_mse = -scores.mean()
cv_std  = scores.std()
print(f"CV MSE: {cv_mse:.2f} ± {cv_std:.2f}")
CV MSE: 284.67 ± 168.21

sklearn uses neg_mean_squared_error because its CV infrastructure maximizes the score by default — negating MSE converts the minimization problem to a maximization one. Always negate when extracting the value.

Leave-One-Out Cross-Validation (LOOCV)

LOOCV is K-Fold with : each fold trains on samples and tests on 1.

python
loo = LeaveOneOut()
scores = cross_val_score(
    LinearRegression(), X, y,
    cv=loo,
    scoring='neg_mean_squared_error'
)
print(f"LOOCV MSE per sample: {-scores.round(1)}")
print(f"LOOCV Mean MSE: {-scores.mean():.4f}")
LOOCV MSE per sample: [ 9.2 8.1 44.5 44.9 9.0 9.3] LOOCV Mean MSE: 20.8333

LOOCV MSE (20.8) is close to full-data Train MSE (22.2) — expected for linear regression on a near-linear dataset. Note that samples 3 and 4 (sq_ft = 1100 and 1400) have higher individual LOO errors: those middle values are hardest to predict when only 5 other samples define the line.

K-Fold (K=3) LOOCV (K=6) <g font-size="9" fill="#334155"> <text x="10" y="36">Fold 1</text> <text x="10" y="72">Fold 2</text> <text x="10" y="108">Fold 3</text> <rect x="55" y="22" width="30" height="22" fill="#ef4444" rx="2"/> <rect x="88" y="22" width="30" height="22" fill="#ef4444" rx="2"/> <rect x="121" y="22" width="30" height="22" fill="#dbeafe" rx="2"/> <rect x="154" y="22" width="30" height="22" fill="#dbeafe" rx="2"/> <rect x="187" y="22" width="30" height="22" fill="#dbeafe" rx="2"/> <rect x="220" y="22" width="30" height="22" fill="#dbeafe" rx="2"/> <rect x="55" y="58" width="30" height="22" fill="#dbeafe" rx="2"/> <rect x="88" y="58" width="30" height="22" fill="#dbeafe" rx="2"/> <rect x="121" y="58" width="30" height="22" fill="#ef4444" rx="2"/> <rect x="154" y="58" width="30" height="22" fill="#ef4444" rx="2"/> <rect x="187" y="58" width="30" height="22" fill="#dbeafe" rx="2"/> <rect x="220" y="58" width="30" height="22" fill="#dbeafe" rx="2"/> <rect x="55" y="94" width="30" height="22" fill="#dbeafe" rx="2"/> <rect x="88" y="94" width="30" height="22" fill="#dbeafe" rx="2"/> <rect x="121" y="94" width="30" height="22" fill="#dbeafe" rx="2"/> <rect x="154" y="94" width="30" height="22" fill="#dbeafe" rx="2"/> <rect x="187" y="94" width="30" height="22" fill="#ef4444" rx="2"/> <rect x="220" y="94" width="30" height="22" fill="#ef4444" rx="2"/> <text x="60" y="137" fill="#ef4444">■ test</text> <text x="100" y="137" fill="#3b82f6">■ train</text> <text x="280" y="36">Fold 1</text> <text x="280" y="60">Fold 2</text> <text x="280" y="84">Fold 3</text> <text x="280" y="108">Fold 4</text> <text x="280" y="132">Fold 5</text> <text x="280" y="156">Fold 6</text> <rect x="330" y="22" width="22" height="18" fill="#ef4444" rx="2"/> <rect x="355" y="22" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="380" y="22" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="405" y="22" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="430" y="22" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="455" y="22" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="330" y="46" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="355" y="46" width="22" height="18" fill="#ef4444" rx="2"/> <rect x="380" y="46" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="405" y="46" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="430" y="46" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="455" y="46" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="330" y="70" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="355" y="70" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="380" y="70" width="22" height="18" fill="#ef4444" rx="2"/> <rect x="405" y="70" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="430" y="70" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="455" y="70" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="330" y="94" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="355" y="94" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="380" y="94" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="405" y="94" width="22" height="18" fill="#ef4444" rx="2"/> <rect x="430" y="94" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="455" y="94" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="330" y="118" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="355" y="118" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="380" y="118" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="405" y="118" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="430" y="118" width="22" height="18" fill="#ef4444" rx="2"/> <rect x="455" y="118" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="330" y="142" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="355" y="142" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="380" y="142" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="405" y="142" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="430" y="142" width="22" height="18" fill="#dbeafe" rx="2"/> <rect x="455" y="142" width="22" height="18" fill="#ef4444" rx="2"/> </g>

K-Fold vs LOOCV Tradeoff

K-Fold (K=5 or 10)LOOCV
Variance of estimateHigher (small test sets)Lower (max training data)
Bias of estimateSlightly higher (less training data)Near-zero
ComputationK model fitsn model fits
For large nPracticalImpractical
Recommended forMost casesSmall datasets ()

Nested Cross-Validation — Hyperparameter Tuning Without Leaking

A common mistake: use CV to select and then report that same CV score as the model's performance estimate. The problem: you searched over using all available data — the test folds influenced which was chosen. The reported score is optimistically biased.

The correct structure is nested CV:

python
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
from sklearn.linear_model import Ridge

# Inner CV: selects best lambda within each outer training set
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)
# Outer CV: evaluates the final model on data not seen during lambda selection
outer_cv = KFold(n_splits=3, shuffle=True, random_state=0)

param_grid = {'alpha': [0.1, 1, 10, 100]}
gs = GridSearchCV(Ridge(), param_grid, cv=inner_cv, scoring='neg_mean_squared_error')

outer_scores = cross_val_score(gs, X, y, cv=outer_cv, scoring='neg_mean_squared_error')
print(f"Nested CV MSE: {-outer_scores.mean():.2f} ± {-outer_scores.std():.2f}")
Nested CV MSE: 38.54 ± 12.11

In each outer fold:

  1. The inner CV searches over alpha = [0.1, 1, 10, 100] using only the outer training set.
  2. The best alpha is refitted on the outer training set.
  3. Performance is evaluated on the outer test fold — data the inner CV never saw.

The outer CV score is an unbiased estimate of the model's performance after hyperparameter selection. It will typically be worse than a non-nested estimate because it's honest.

Choosing K in Practice

Dataset sizeRecommended approach
LOOCV or K = n−1
K = 10 (standard)
K = 5 (speed) or single hold-out
Time seriesTime-based split — no random shuffle

For time series data (TimeSeries Split in sklearn), shuffling is invalid — future data cannot be used to predict the past. Each fold must use only earlier time steps for training and later ones for testing.

CV Flavors Summary

MethodFoldsBest For
K-FoldKGeneral-purpose evaluation
LOOCVnSmall datasets ()
Stratified K-FoldKClassification (balanced class proportions)
TimeSeriesSplitKSequential / temporal data
Nested CVK×KHyperparameter tuning + unbiased evaluation

CV gives an estimate of generalization error — it doesn't guarantee it. If the entire dataset has a systematic bias (only expensive houses from one city), CV will accurately estimate how the model generalizes to that biased distribution, not to all house prices. CV improves the reliability of the estimate; it cannot fix data collection problems.

On our 6-sample anchor, the CV standard deviation of ±168 is larger than the mean of 284. This means the error bar spans from roughly 116 to 453 — not a useful estimate. The honest conclusion: 6 samples is too few for reliable CV. Cross-validation needs at least 50–100 samples to produce stable estimates; with fewer, LOOCV is the best available option but still has high variance.

Test Your Understanding

  1. K-Fold with is identical to LOOCV. K-Fold with would train on zero samples. What is the minimum useful , and what does the extreme represent?

  2. The LOOCV MSE (20.83) is close to the full-data Train MSE (22.22) for linear regression. For a highly non-linear model like a degree-7 polynomial, would you expect LOOCV MSE to be close to or far from train MSE? Why?

  3. In nested CV, the inner CV selects for every outer fold. If the same wins every time, is nested CV still necessary? What does this consistency tell you about the hyperparameter landscape?

  4. You run 5-fold CV on a 1000-sample dataset and get MSE = 0.54 ± 0.12. You then run 10-fold CV and get MSE = 0.52 ± 0.08. The mean barely changed but variance dropped. Explain geometrically why more folds reduces variance of the CV estimate.

  5. A colleague argues: "I'll use a 90/10 train/test split and just repeat it 10 times with different random seeds, then average the 10 test MSEs. Isn't that equivalent to 10-fold CV?" Is it? What's the key structural difference?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment