Back to blog
← View series: statistics

~/blog

Sampling

Apr 11, 20268 min readBy mohammed.vasim
StatisticsMathData Science

Every model evaluation involves sampling, whether you think about it that way or not. When you run 6-fold cross-validation, you are not measuring the model's performance on all possible data — you are measuring it on six specific data subsets and using those to estimate something broader. That act of estimation only works if you understand what "population" you are trying to generalize to, and whether your sample actually represents it.

This distinction — population versus sample — ripples through every statistical method you will ever use in machine learning.

The Anchor Dataset

Throughout this post, every example uses six cross-validation accuracy scores from a classifier:

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

These six folds are a sample. The population is the model's accuracy on all possible held-out test sets drawn from the same distribution — a quantity you can never fully observe.

What Do These Words Actually Mean?

A population is the complete set of all items or observations you are interested in. In ML:

  • All possible test examples the model might encounter in production
  • All possible random seeds and data splits for a given training procedure
  • Every email that will ever be scored by a spam classifier

A sample is the portion you actually observe. The six CV folds in accuracy are a sample. From that sample, you estimate the population parameter — the model's true generalization accuracy.

The goal is to use the sample to understand the population. That only works if your sample actually represents the population.

Why You Cannot Just Measure the Whole Population

You might ask: why not just use all the data? The answer depends on context:

Physical impossibility. Future production data does not exist yet. You cannot measure model accuracy on data that has not arrived.

Cost and time. Training 1,000 models with different hyperparameters to find a good configuration would take weeks. You sample configurations instead.

Destructive measurement. In some domains (materials testing, clinical trials), measuring one sample makes it unusable — you cannot reuse the same patient in a double-blind trial.

The Notation You Will Encounter

SymbolMeaning
Population size
Sample size (here, )
(mu)Population mean — the model's true generalization accuracy
(x-bar)Sample mean — computed from our six folds:
(sigma)Population standard deviation
Sample standard deviation

This notation matters because it reminds you constantly: am I looking at a fixed truth (population parameter, like ) or an estimate that varies depending on which sample I happened to draw (sample statistic, like )?

Sampling Methods: How You Sample Matters

Simple random sampling gives every member of the population an equal chance of selection. Stratified k-fold cross-validation approximates this within each class.

Stratified sampling divides the population into subgroups and samples from each. Stratified k-fold ensures that each fold has the same class distribution as the full dataset. This matters for imbalanced datasets — without stratification, one fold might have no minority-class examples at all.

Systematic sampling selects every -th item. Time-series models sometimes use this to create expanding or sliding window folds.

Cluster sampling randomly selects whole groups. In some NLP datasets, all documents from one author are kept together to avoid data leakage — those authors form natural clusters.

Convenience sampling — using whatever is easiest — is common and often problematic. Scraping data from one website, using only English-language text, testing only on one device: all of these create samples that may not represent the population you care about.

Simple Random 1 2 3 4 5 6 7 8 9 10 11 12 Stratified (2 classes) Class A Class B 1 2 3 4 5 6 7 8 Selected Not selected

The One Rule That Trumps Everything

Your sample must be representative of the population. A biased sample gives wrong answers even when your calculations are perfect.

The most famous ML-adjacent example: in the early days of image classification, models trained on ImageNet achieved high benchmark accuracy but failed badly in the wild. The benchmark population (curated, centered images) did not represent the real-world population (blurry, partially occluded, unusual angles). The math was fine. The sampling was broken.

The Sampling Distribution

When you compute the mean of your six CV folds, that mean is itself a random variable. If you ran 6-fold CV a hundred times with different random splits, you would get a hundred different sample means. That collection of means is called the sampling distribution of the mean.

The standard deviation of this sampling distribution has a special name: standard error. It tells you how much your sample estimates vary across hypothetical repetitions:

For our CV scores, and :

This means the mean accuracy across repetitions of the same 6-fold procedure would vary by about ±0.020. More folds reduce this — but by a square root relationship, so you need four times as many folds to cut the error in half.

Python Example

python
import numpy as np
from scipy import stats

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

n = len(accuracy)
mean_acc = np.mean(accuracy)
std_acc = np.std(accuracy, ddof=1)
se = std_acc / np.sqrt(n)

print(f"Sample mean (x-bar): {mean_acc:.3f}")
print(f"Sample std dev (s):  {std_acc:.3f}")
print(f"Standard error:      {se:.3f}")

ci = stats.t.interval(0.95, df=n-1, loc=mean_acc, scale=se)
print(f"95% CI for true accuracy: ({ci[0]:.3f}, {ci[1]:.3f})")
Sample mean (x-bar): 0.838 Sample std dev (s): 0.049 Standard error: 0.020 95% CI for true accuracy: (0.787, 0.890)

The sample mean 0.838 is the descriptive statistic. The 95% CI (0.787, 0.890) is the inferential claim: if you repeated this exact CV procedure many times, about 95% of the intervals you constructed would contain the model's true generalization accuracy.

Calculation Trace

PhaseFormulaValuesResult
Sample mean
Sample std devDeviations squared and summed
Standard error
95% CI

The previous posts established what statistics is and how its two branches differ. This post provides the foundation for everything inferential: the concept of a population parameter, a sample statistic, and the gap between them. Central tendency and dispersion (covered next) are the specific descriptive quantities you compute from samples. Standard error — introduced here — is how you quantify the reliability of those estimates when making inferential claims.

When This Framework Breaks Down

The formula assumes your observations are independent. In time-series CV, folds are not independent — fold 3 overlaps with fold 2 in time. The standard error formula underestimates true uncertainty in that case. For time-series models, use proper temporal cross-validation (no look-ahead) and interpret confidence intervals cautiously. With fewer than 10 independent folds and a skewed accuracy distribution, the t-interval may not be reliable; consider bootstrapped confidence intervals instead.

Test Your Understanding

  1. The six folds in accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88] are a sample. What is the population? Can you ever observe the population directly?

  2. If you increase from 6 to 24 CV folds, by what factor does the standard error decrease? Show your calculation.

  3. A model trained on English tweets is evaluated on an English tweet test set. The sample mean accuracy is 0.91. The model is deployed to process tweets in Spanish, French, and English. Is the 95% CI computed from this CV evaluation a valid estimate of production accuracy? Why or why not?

  4. Stratified 6-fold CV ensures each fold has the same class ratio as the full dataset. For a dataset that is 95% class A and 5% class B, why is stratification especially important? What goes wrong without it?


Understanding this population-sample distinction sets you up well for the next topic: Measure of Central Tendency — how to find the "typical" value in your data.


Previous: Types of Statistics | Next: Measure of Central Tendency

Comments (0)

No comments yet. Be the first to comment!

Leave a comment