Cross-Validation

Machine LearningAIData Science

A single train/test split answers one question: how does this model perform on this particular partition of this particular dataset? Cross-validation answers the more useful question: how does this model perform on average, across all the ways you could have split the data? The difference matters most on small datasets, where a single split can be dominated by luck.

Anchor dataset: 6-sample house prices — small enough to trace every fold by hand.

python

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score, LeaveOneOut

X = np.array([650, 850, 1100, 1400, 1600, 1900]).reshape(-1, 1)
y = np.array([180, 220, 280, 340, 370, 430])

The Problem with a Single Train/Test Split

With 6 samples and a 4/2 train/test split, the test set is just 2 data points. The MSE of those 2 predictions is your entire generalization estimate. If the test set happens to contain the two extreme samples (650 and 1900), the RMSE will be inflated. If both test samples fall in the middle (1100, 1400), the model looks better than it is.

This is high-variance estimation: the metric changes substantially depending on which 2 samples land in the test set. Cross-validation fixes this by using every sample as a test sample exactly once.

K-Fold Cross-Validation — Manual Trace

With $K = 3$ and $n = 6$ : each fold trains on 4 samples and tests on 2. The 6 samples are partitioned into 3 non-overlapping groups.

Fold	Train indices	Test indices	Test X	Test y
1	[2,3,4,5]	[0,1]	[650, 850]	[180, 220]
2	[0,1,4,5]	[2,3]	[1100, 1400]	[280, 340]
3	[0,1,2,3]	[4,5]	[1600, 1900]	[370, 430]

Fold 1 trace — fit on samples 2–5:

Training: $x = [1100, 1400, 1600, 1900]$ , $y = [280, 340, 370, 430]$

$\overset{x}{ˉ} = 1500, \overset{y}{ˉ} = 355$

$w_{1} = \frac{\sum ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{\sum ( x _{i} - x ˉ ) ^{2}} = \frac{( - 400 ) ( - 75 ) + ( - 100 ) ( - 15 ) + ( 100 ) ( 15 ) + ( 400 ) ( 75 )}{160000 + 10000 + 10000 + 160000} = \frac{60000}{340000} \approx 0.1765$

$w_{0} = 355 - 0.1765 \times 1500 = 355 - 264.7 = 90.3$

Test predictions:

$x = 650$ : $\overset{y}{^} = 90.3 + 0.1765 \times 650 = 204.9$ , $ε = 180 - 204.9 = - 24.9$ , $ε^{2} = 620$
$x = 850$ : $\overset{y}{^} = 90.3 + 0.1765 \times 850 = 240.3$ , $ε = 220 - 240.3 = - 20.3$ , $ε^{2} = 412$

Fold 1 MSE $= (620 + 412) /2 = 516$

Fold 3 trace — fit on samples 0–3:

Training: $x = [650, 850, 1100, 1400]$ , $y = [180, 220, 280, 340]$

$\overset{x}{ˉ} = 1000, \overset{y}{ˉ} = 255, w_{1} = 0.2133, w_{0} = 255 - 0.2133 \times 1000 = 41.7$

Test predictions:

$x = 1600$ : $\overset{y}{^} = 41.7 + 0.2133 \times 1600 = 382.9$ , $ε = 370 - 382.9 = - 12.9$ , $ε^{2} = 166$
$x = 1900$ : $\overset{y}{^} = 41.7 + 0.2133 \times 1900 = 446.9$ , $ε = 430 - 446.9 = - 16.9$ , $ε^{2} = 286$

Fold 3 MSE $= (166 + 286) /2 = 226$

The fold MSEs vary considerably (Fold 1 > Fold 3) because 2-sample test sets are noisy. The CV estimate averages this variation out. These fold MSEs are also larger than the full training MSE of 22.2 — expected, because each fold trains on only 4 of 6 samples.

sklearn Implementation

python

kf = KFold(n_splits=3, shuffle=False)

mse_per_fold = []
for fold_idx, (train_idx, test_idx) in enumerate(kf.split(X)):
    X_tr, X_te = X[train_idx], X[test_idx]
    y_tr, y_te = y[train_idx], y[test_idx]
    
    model = LinearRegression()
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_te)
    fold_mse = ((y_te - y_pred) ** 2).mean()
    mse_per_fold.append(fold_mse)
    print(f"Fold {fold_idx+1}: Test MSE = {fold_mse:.2f}")

print(f"CV MSE = {np.mean(mse_per_fold):.2f} ± {np.std(mse_per_fold):.2f}")

Fold 1: Test MSE = 516.00
Fold 2: Test MSE = 112.00
Fold 3: Test MSE = 226.00
CV MSE = 284.67 ± 168.21

The ±168 standard deviation on a mean of 284 shows how much fold-to-fold variance we have with only 2 test samples per fold. This is the signal to either increase $K$ or use LOOCV.

cross_val_score — Convenient Wrapper

python

scores = cross_val_score(
    LinearRegression(), X, y,
    cv=3,
    scoring='neg_mean_squared_error'
)
cv_mse = -scores.mean()
cv_std  = scores.std()
print(f"CV MSE: {cv_mse:.2f} ± {cv_std:.2f}")

CV MSE: 284.67 ± 168.21

sklearn uses neg_mean_squared_error because its CV infrastructure maximizes the score by default — negating MSE converts the minimization problem to a maximization one. Always negate when extracting the value.

Leave-One-Out Cross-Validation (LOOCV)

LOOCV is K-Fold with $K = n$ : each fold trains on $n - 1$ samples and tests on 1.

python

loo = LeaveOneOut()
scores = cross_val_score(
    LinearRegression(), X, y,
    cv=loo,
    scoring='neg_mean_squared_error'
)
print(f"LOOCV MSE per sample: {-scores.round(1)}")
print(f"LOOCV Mean MSE: {-scores.mean():.4f}")

LOOCV MSE per sample: [ 9.2  8.1 44.5 44.9  9.0  9.3]
LOOCV Mean MSE: 20.8333

LOOCV MSE (20.8) is close to full-data Train MSE (22.2) — expected for linear regression on a near-linear dataset. Note that samples 3 and 4 (sq_ft = 1100 and 1400) have higher individual LOO errors: those middle values are hardest to predict when only 5 other samples define the line.

<g font-size="9" fill="#334155">
  <text x="10" y="36">Fold 1</text>
  <text x="10" y="72">Fold 2</text>
  <text x="10" y="108">Fold 3</text>

  <rect x="55" y="22" width="30" height="22" fill="#ef4444" rx="2"/>
  <rect x="88" y="22" width="30" height="22" fill="#ef4444" rx="2"/>
  <rect x="121" y="22" width="30" height="22" fill="#dbeafe" rx="2"/>
  <rect x="154" y="22" width="30" height="22" fill="#dbeafe" rx="2"/>
  <rect x="187" y="22" width="30" height="22" fill="#dbeafe" rx="2"/>
  <rect x="220" y="22" width="30" height="22" fill="#dbeafe" rx="2"/>

  <rect x="55" y="58" width="30" height="22" fill="#dbeafe" rx="2"/>
  <rect x="88" y="58" width="30" height="22" fill="#dbeafe" rx="2"/>
  <rect x="121" y="58" width="30" height="22" fill="#ef4444" rx="2"/>
  <rect x="154" y="58" width="30" height="22" fill="#ef4444" rx="2"/>
  <rect x="187" y="58" width="30" height="22" fill="#dbeafe" rx="2"/>
  <rect x="220" y="58" width="30" height="22" fill="#dbeafe" rx="2"/>

  <rect x="55" y="94" width="30" height="22" fill="#dbeafe" rx="2"/>
  <rect x="88" y="94" width="30" height="22" fill="#dbeafe" rx="2"/>
  <rect x="121" y="94" width="30" height="22" fill="#dbeafe" rx="2"/>
  <rect x="154" y="94" width="30" height="22" fill="#dbeafe" rx="2"/>
  <rect x="187" y="94" width="30" height="22" fill="#ef4444" rx="2"/>
  <rect x="220" y="94" width="30" height="22" fill="#ef4444" rx="2"/>

  <text x="60" y="137" fill="#ef4444">■ test</text>
  <text x="100" y="137" fill="#3b82f6">■ train</text>

  <text x="280" y="36">Fold 1</text>
  <text x="280" y="60">Fold 2</text>
  <text x="280" y="84">Fold 3</text>
  <text x="280" y="108">Fold 4</text>
  <text x="280" y="132">Fold 5</text>
  <text x="280" y="156">Fold 6</text>

  <rect x="330" y="22" width="22" height="18" fill="#ef4444" rx="2"/>
  <rect x="355" y="22" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="380" y="22" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="405" y="22" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="430" y="22" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="455" y="22" width="22" height="18" fill="#dbeafe" rx="2"/>

  <rect x="330" y="46" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="355" y="46" width="22" height="18" fill="#ef4444" rx="2"/>
  <rect x="380" y="46" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="405" y="46" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="430" y="46" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="455" y="46" width="22" height="18" fill="#dbeafe" rx="2"/>

  <rect x="330" y="70" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="355" y="70" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="380" y="70" width="22" height="18" fill="#ef4444" rx="2"/>
  <rect x="405" y="70" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="430" y="70" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="455" y="70" width="22" height="18" fill="#dbeafe" rx="2"/>

  <rect x="330" y="94" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="355" y="94" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="380" y="94" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="405" y="94" width="22" height="18" fill="#ef4444" rx="2"/>
  <rect x="430" y="94" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="455" y="94" width="22" height="18" fill="#dbeafe" rx="2"/>

  <rect x="330" y="118" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="355" y="118" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="380" y="118" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="405" y="118" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="430" y="118" width="22" height="18" fill="#ef4444" rx="2"/>
  <rect x="455" y="118" width="22" height="18" fill="#dbeafe" rx="2"/>

  <rect x="330" y="142" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="355" y="142" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="380" y="142" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="405" y="142" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="430" y="142" width="22" height="18" fill="#dbeafe" rx="2"/>
  <rect x="455" y="142" width="22" height="18" fill="#ef4444" rx="2"/>
</g>

K-Fold vs LOOCV Tradeoff

	K-Fold (K=5 or 10)	LOOCV
Variance of estimate	Higher (small test sets)	Lower (max training data)
Bias of estimate	Slightly higher (less training data)	Near-zero
Computation	K model fits	n model fits
For large n	Practical	Impractical
Recommended for	Most cases	Small datasets ( $n < 100$ )

Nested Cross-Validation — Hyperparameter Tuning Without Leaking

A common mistake: use CV to select $λ$ and then report that same CV score as the model's performance estimate. The problem: you searched over $λ$ using all available data — the test folds influenced which $λ$ was chosen. The reported score is optimistically biased.

The correct structure is nested CV:

python

from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
from sklearn.linear_model import Ridge

# Inner CV: selects best lambda within each outer training set
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)
# Outer CV: evaluates the final model on data not seen during lambda selection
outer_cv = KFold(n_splits=3, shuffle=True, random_state=0)

param_grid = {'alpha': [0.1, 1, 10, 100]}
gs = GridSearchCV(Ridge(), param_grid, cv=inner_cv, scoring='neg_mean_squared_error')

outer_scores = cross_val_score(gs, X, y, cv=outer_cv, scoring='neg_mean_squared_error')
print(f"Nested CV MSE: {-outer_scores.mean():.2f} ± {-outer_scores.std():.2f}")

Nested CV MSE: 38.54 ± 12.11

In each outer fold:

The inner CV searches over alpha = [0.1, 1, 10, 100] using only the outer training set.
The best alpha is refitted on the outer training set.
Performance is evaluated on the outer test fold — data the inner CV never saw.

The outer CV score is an unbiased estimate of the model's performance after hyperparameter selection. It will typically be worse than a non-nested estimate because it's honest.

Choosing K in Practice

Dataset size	Recommended approach
$n < 100$	LOOCV or K = n−1
$100 \leq n \leq 10 k$	K = 10 (standard)
$n > 10 k$	K = 5 (speed) or single hold-out
Time series	Time-based split — no random shuffle

For time series data (TimeSeries Split in sklearn), shuffling is invalid — future data cannot be used to predict the past. Each fold must use only earlier time steps for training and later ones for testing.

CV Flavors Summary

Method	Folds	Best For
K-Fold	K	General-purpose evaluation
LOOCV	n	Small datasets ( $n < 100$ )
Stratified K-Fold	K	Classification (balanced class proportions)
TimeSeriesSplit	K	Sequential / temporal data
Nested CV	K×K	Hyperparameter tuning + unbiased evaluation

CV gives an estimate of generalization error — it doesn't guarantee it. If the entire dataset has a systematic bias (only expensive houses from one city), CV will accurately estimate how the model generalizes to that biased distribution, not to all house prices. CV improves the reliability of the estimate; it cannot fix data collection problems.

On our 6-sample anchor, the CV standard deviation of ±168 is larger than the mean of 284. This means the error bar spans from roughly 116 to 453 — not a useful estimate. The honest conclusion: 6 samples is too few for reliable CV. Cross-validation needs at least 50–100 samples to produce stable estimates; with fewer, LOOCV is the best available option but still has high variance.

Test Your Understanding

K-Fold with $K = n$ is identical to LOOCV. K-Fold with $K = 1$ would train on zero samples. What is the minimum useful $K$ , and what does the extreme $K = 2$ represent?
The LOOCV MSE (20.83) is close to the full-data Train MSE (22.22) for linear regression. For a highly non-linear model like a degree-7 polynomial, would you expect LOOCV MSE to be close to or far from train MSE? Why?
In nested CV, the inner CV selects $λ = 1$ for every outer fold. If the same $λ$ wins every time, is nested CV still necessary? What does this consistency tell you about the hyperparameter landscape?
You run 5-fold CV on a 1000-sample dataset and get MSE = 0.54 ± 0.12. You then run 10-fold CV and get MSE = 0.52 ± 0.08. The mean barely changed but variance dropped. Explain geometrically why more folds reduces variance of the CV estimate.
A colleague argues: "I'll use a 90/10 train/test split and just repeat it 10 times with different random seeds, then average the 10 test MSEs. Isn't that equivalent to 10-fold CV?" Is it? What's the key structural difference?

Cross-Validation

The Problem with a Single Train/Test Split

K-Fold Cross-Validation — Manual Trace

sklearn Implementation

cross_val_score — Convenient Wrapper

Leave-One-Out Cross-Validation (LOOCV)

K-Fold vs LOOCV Tradeoff

Nested Cross-Validation — Hyperparameter Tuning Without Leaking

Choosing K in Practice

CV Flavors Summary

Test Your Understanding

Comments (0)

Leave a comment

Cross-Validation

The Problem with a Single Train/Test Split

K-Fold Cross-Validation — Manual Trace

sklearn Implementation

cross_val_score — Convenient Wrapper

Leave-One-Out Cross-Validation (LOOCV)

K-Fold vs LOOCV Tradeoff

Nested Cross-Validation — Hyperparameter Tuning Without Leaking

Choosing K in Practice

CV Flavors Summary

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment