~/blog

Population and Sample

Apr 11, 2026•14 min read•By Mohammed Vasim

StatisticsMathData Science

Every model evaluation involves sampling, whether you think about it that way or not. When you run 6-fold cross-validation, you are not measuring the model's performance on all possible data — you are measuring it on six specific data subsets and using those to estimate something broader. That act of estimation only works if you understand what "population" you are trying to generalize to, and whether your sample actually represents it.

This distinction — population versus sample — ripples through every statistical method you will ever use in machine learning.

The Anchor Dataset

Throughout this post, every example uses six cross-validation accuracy scores from a classifier:

python

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

These six folds are a sample. The population is the model's accuracy on all possible held-out test sets drawn from the same distribution — a quantity you can never fully observe.

What Do These Words Actually Mean?

A population is the complete set of all items or observations you are interested in. It is usually infinite or too large to measure entirely. In ML:

All possible test examples the model might encounter in production
All possible random seeds and data splits for a given training procedure
Every email that will ever be scored by a spam classifier

A sample is the portion you actually observe. The six CV folds in accuracy are a sample. From that sample, you estimate the population parameter — the model's true generalization accuracy.

The goal is to use the sample to understand the population. That only works if your sample actually represents the population.

Why You Cannot Just Measure the Whole Population

You might ask: why not just use all the data? The answer depends on context:

Physical impossibility. Future production data does not exist yet. You cannot measure model accuracy on data that has not arrived.

Cost and time. Training 1,000 models with different hyperparameters to find a good configuration would take weeks. You sample configurations instead.

Destructive measurement. In some domains (materials testing, clinical trials), measuring one sample makes it unusable — you cannot reuse the same patient in a double-blind trial.

Parameters vs Statistics

This is the notation distinction that trips up every newcomer, but it encodes something important: are you looking at a fixed truth about the world, or an estimate computed from the data you happened to collect?

Parameters describe the population. They are fixed — they do not change from study to study — but they are unknown. You are always trying to estimate them.

μ (mu): the true population mean. In the anchor: the model's true generalization accuracy across all possible folds.
σ²: the true population variance. Unknown.
σ: the true population standard deviation. Unknown.
π: the true population proportion (for binary outcomes). Unknown.
Notation: always Greek letters.

Statistics describe the sample. They are computed from the data you have. They are known, but they vary — collect a different sample and you get a different statistic.

x̄ (x-bar): the sample mean. Estimator of μ. From the anchor: x̄ = 0.838.
s²: the sample variance. Estimator of σ².
s: the sample standard deviation. Estimator of σ.
p̂ (p-hat): the sample proportion. Estimator of π.
Notation: Latin letters with hats or bars.

The full symbol reference:

Symbol	Type	What it measures	Known?
μ	Parameter	Population mean	Unknown
σ	Parameter	Population std deviation	Unknown
σ²	Parameter	Population variance	Unknown
x̄	Statistic	Sample mean	Known (0.838)
s	Statistic	Sample std deviation	Known (0.051)
s²	Statistic	Sample variance	Known

Sampling Variability — Why One Sample Is Not Enough

The six folds in accuracy are not the only 6-fold split you could have drawn. A different random seed would give six different folds and a different x̄. This is sampling variability — and it is not error in your calculation. It is a fundamental property of working with samples.

To make it concrete, consider three hypothetical draws from the same distribution:

Sample A (the anchor): x̄ = 0.838 — the folds you actually ran
Sample B: x̄ = 0.821 — a harder split where more difficult examples landed in the validation set
Sample C: x̄ = 0.849 — an easier split with more representative validation folds

All three are valid estimates of μ. None is "wrong." They differ because a sample is not the population.

The collection of all possible x̄ values — one per hypothetical repetition of the CV procedure — is called the sampling distribution of the mean. Its spread tells you how reliable any single sample mean is.

As you increase the number of folds n, the sampling distribution narrows. The standard error quantifies this:

$S E = \frac{σ}{n}$

In practice, σ is unknown so you substitute s. With s ≈ 0.051 and n = 6:

$S E = \frac{0.051}{6} \approx 0.021$

Sampling Methods: How You Sample Matters

Simple Random Sampling (SRS) gives every member of the population an equal probability of selection. In CV, randomly selecting which data points go into each fold approximates SRS. Limitation: with rare classes, you might draw 6 folds that contain zero minority-class examples purely by chance.

Stratified Sampling divides the population into subgroups (strata) and samples from each proportionally. scikit-learn's StratifiedKFold does exactly this — it ensures each fold has the same class ratio as the full dataset. For an imbalanced dataset with 95% class A and 5% class B, this prevents folds where class B never appears. Without it, accuracy on class B is essentially unmeasurable.

Systematic Sampling selects every k-th element from a sorted or ordered list. In ML, logging metrics at every 100th training step is systematic sampling of the loss curve. Limitation: if the data has a periodic pattern at the same frequency as k, you will consistently hit the same phase of the cycle — biasing the sample.

Cluster Sampling randomly selects whole groups (clusters) rather than individual elements. In NLP, sampling entire documents (clusters of sentences) rather than individual sentences keeps the linguistic context intact and avoids leakage at sentence boundaries. Limitation: if documents from one source are internally homogeneous, you get less diversity than SRS of the same total number of sentences.

Convenience Sampling uses whatever is easiest to collect: the first 1,000 rows of a CSV, all reviews from one platform, all users from one geographic region. It is common and almost always biased. If you evaluate a model only on the first 1,000 rows because they loaded quickly, and those rows were collected before a distribution shift, your accuracy estimate does not generalize.

Sampling Methods Summary

Method	When to Use	DS/ML Example	Main Limitation
SRS	Default when population is accessible	Random fold splits	Might miss rare classes
Stratified	Class imbalance, subgroup fairness	`StratifiedKFold`	Requires knowing strata
Systematic	Sequential or streaming data	Log every 100th training step	Periodic pattern bias
Cluster	Expensive individual collection	Sample entire NLP documents	Low diversity if clusters are homogeneous
Convenience	(Never recommended)	First 1,000 rows of a CSV	Always biased

Sampling Bias

Even a well-chosen sampling method can produce biased results if the process of collecting data is systematically skewed.

Selection bias means certain elements are more likely to be selected than others. Testing a sentiment model on Amazon product reviews excludes customers who never write reviews — a different population from all customers. The model's measured accuracy will be higher than its true production accuracy on the full customer base.

Survivorship bias means you only analyze elements that passed some filter. If you evaluate a model on examples where it was already confident (confidence > 0.7), the difficult edge cases are invisible. The reported accuracy looks great; the model fails on the inputs that matter most.

Temporal bias is specific to ML: if you accidentally train on data from the future and test on data from the past, the model has already "seen" the test distribution. This is data leakage. Proper temporal CV always trains on the past and tests on the future — never the reverse.

For each of these, the biased conclusion feels internally consistent. The problem is not in the math — it is in the mismatch between the sample you collected and the population you care about.

The One Rule That Trumps Everything

Your sample must be representative of the population. A biased sample gives wrong answers even when your calculations are perfect.

The most famous ML-adjacent example: in the early days of image classification, models trained on ImageNet achieved high benchmark accuracy but failed badly in the wild. The benchmark population (curated, centered images) did not represent the real-world population (blurry, partially occluded, unusual angles). The math was fine. The sampling was broken.

The Sampling Distribution

When you compute the mean of your six CV folds, that mean is itself a random variable. If you ran 6-fold CV a hundred times with different random splits, you would get a hundred different sample means. That collection of means is called the sampling distribution of the mean.

The standard deviation of this sampling distribution has a special name: standard error. It tells you how much your sample estimates vary across hypothetical repetitions:

$S E = \frac{s}{n}$

For our CV scores, $s \approx 0.051$ and $n = 6$ :

$S E = \frac{0.051}{6} \approx 0.021$

This means the mean accuracy across repetitions of the same 6-fold procedure would vary by about ±0.020. More folds reduce this — but by a square root relationship, so you need four times as many folds to cut the error in half.

Python Example

python

import numpy as np
from scipy import stats

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

n = len(accuracy)
mean_acc = np.mean(accuracy)
std_acc = np.std(accuracy, ddof=1)
se = std_acc / np.sqrt(n)

print(f"Sample mean (x-bar): {mean_acc:.3f}")
print(f"Sample std dev (s):  {std_acc:.3f}")
print(f"Standard error:      {se:.3f}")

ci = stats.t.interval(0.95, df=n-1, loc=mean_acc, scale=se)
print(f"95% CI for true accuracy: ({ci[0]:.3f}, {ci[1]:.3f})")

text

Sample mean (x-bar): 0.838
Sample std dev (s):  0.051
Standard error:      0.021
95% CI for true accuracy: (0.785, 0.892)

The sample mean 0.838 is the descriptive statistic. The 95% CI (0.787, 0.890) is the inferential claim: if you repeated this exact CV procedure many times, about 95% of the intervals you constructed would contain the model's true generalization accuracy.

Calculation Trace

Phase	Formula	Values	Result
Sample mean	$\overset{x}{ˉ} = \frac{\sum x _{i}}{n}$	$(0.82 + 0.79 + 0.91 + 0.85 + 0.78 + 0.88) /6$	$0.838$
Sample std dev	$s = \frac{\sum ( x _{i} - x ˉ ) ^{2}}{n - 1}$	Deviations squared and summed	$0.051$
Standard error	$S E = s / n$	$0.051/ 6$	$0.021$
95% CI	$\overset{x}{ˉ} \pm t^{*} \cdot S E$	$0.838 \pm 2.571 \times 0.021$	$(0.785, 0.892)$

Population vs Sample: At a Glance

Concept	Population	Sample
Definition	All possible elements	Subset actually collected
Describes	Parameters	Statistics
Notation	Greek (μ, σ, π)	Latin (x̄, s, p̂)
Known?	Unknown (target)	Known (computed)
Changes per study?	No	Yes

Finite Population Correction

The formula $S E = s / n$ assumes the population is large relative to the sample. When the sample is a substantial fraction of the population — specifically when n/N > 5% — the formula overestimates uncertainty. The corrected form is:

$S E_{corrected} = S E \times \frac{N - n}{N - 1}$

In most ML scenarios the population is effectively infinite (all possible future predictions, all possible random splits), so FPC does not apply. It becomes relevant when sampling from a finite, enumerable list — all employees in a company, all unique users in a given week, all labeled examples in a fixed dataset. If you are sampling 500 users from a pool of 2,000, n/N = 25% and FPC matters.

The previous posts established what statistics is and how its two branches differ. This post provides the foundation for everything inferential: the concept of a population parameter, a sample statistic, and the gap between them. Central tendency and dispersion (covered next) are the specific descriptive quantities you compute from samples. Standard error — introduced here — is how you quantify the reliability of those estimates when making inferential claims.

When This Framework Breaks Down

The formula $S E = s / n$ assumes your observations are independent. In time-series CV, folds are not independent — fold 3 overlaps with fold 2 in time. The standard error formula underestimates true uncertainty in that case. For time-series models, use proper temporal cross-validation (no look-ahead) and interpret confidence intervals cautiously. With fewer than 10 independent folds and a skewed accuracy distribution, the t-interval may not be reliable; consider bootstrapped confidence intervals instead.

Test Your Understanding

The six folds in accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88] are a sample. What is the population? Can you ever observe the population directly?
If you increase from 6 to 24 CV folds, by what factor does the standard error decrease? Show your calculation.
A model trained on English tweets is evaluated on an English tweet test set. The sample mean accuracy is 0.91. The model is deployed to process tweets in Spanish, French, and English. Is the 95% CI computed from this CV evaluation a valid estimate of production accuracy? Why or why not?
Stratified 6-fold CV ensures each fold has the same class ratio as the full dataset. For a dataset that is 95% class A and 5% class B, why is stratification especially important? What goes wrong without it?
You are sampling 400 users from a company with 1,000 total employees to survey about tool usage. Should you apply the finite population correction? Compute the corrected SE if the uncorrected SE is 0.025.

Population and Sample

The Anchor Dataset

What Do These Words Actually Mean?

Why You Cannot Just Measure the Whole Population

Parameters vs Statistics

Sampling Variability — Why One Sample Is Not Enough

Sampling Methods: How You Sample Matters

Sampling Methods Summary

Sampling Bias

The One Rule That Trumps Everything

The Sampling Distribution

Python Example

Calculation Trace

Population vs Sample: At a Glance

Finite Population Correction

When This Framework Breaks Down

Test Your Understanding

Comments (0)

Leave a comment

Population and Sample

The Anchor Dataset

What Do These Words Actually Mean?

Why You Cannot Just Measure the Whole Population

Parameters vs Statistics

Sampling Variability — Why One Sample Is Not Enough

Sampling Methods: How You Sample Matters

Sampling Methods Summary

Sampling Bias

The One Rule That Trumps Everything

The Sampling Distribution

Python Example

Calculation Trace

Population vs Sample: At a Glance

Finite Population Correction

Related Concepts

When This Framework Breaks Down

Test Your Understanding

Comments (0)

Leave a comment