← View series: statistics
~/blog
Descriptive and Inferential Statistics
When you train a model, you collect performance numbers. Then you have to decide: what do these numbers actually mean? That question splits into two fundamentally different acts, and statistics has a branch for each.
The first act is describing what you measured. The second is inferring something about what you did not measure. Getting these confused — treating descriptive results as inferential claims, or skipping description entirely — is one of the most common sources of misleading ML papers and engineering decisions.
The Anchor Dataset
Throughout this post, every example uses six cross-validation accuracy scores from a classifier:
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
These represent six folds of cross-validation. Each value is the fraction of test examples classified correctly in that fold.
Descriptive Statistics: Summarizing What You Have
When you summarize the six CV scores with a mean or a chart, you are doing descriptive statistics. You are not making any guesses beyond the data in front of you. You are organizing it so a human brain can grasp it.
The mean of those six folds:
That 0.838 is a descriptive statistic. It tells you the average performance across those specific six folds — nothing more.
Inferential Statistics: Generalizing Beyond Your Data
But you do not usually care only about those six specific folds. You care about how the model will perform on new, unseen data — data from the real world that was never in your training or test sets. Making a claim about that requires inferential statistics.
The sample mean 0.838 is used to estimate the model's true generalization accuracy across all possible future test data. Because you only measured six folds, there is uncertainty. A 95% confidence interval quantifies that uncertainty:
The sample mean is the bridge. You compute it (descriptive) and use it to estimate the population parameter (inferential). The quality of the inference depends entirely on the quality of the descriptive work.
How the Two Connect
Population (all future test data)
↓ [sample with CV folds]
Sample Data: accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
↓ [Descriptive Statistics]
Summary: mean = 0.838, std = 0.049
↓ [Inferential Statistics]
Estimate: true generalization accuracy is roughly 0.79 to 0.89
Notation Differences
You will encounter different symbols depending on whether you are working with a population or a sample:
| Concept | Population | Sample |
|---|---|---|
| Mean | ||
| Standard Deviation | ||
| Size |
The population mean is the true generalization accuracy of the model — a fixed but unknown quantity. The sample mean is an estimate that would change if you ran different CV folds. This randomness in sample statistics is exactly what allows us to quantify uncertainty in inferential work.
When Each Approach Makes Sense
Use descriptive statistics when you need to summarize the data you have. A leaderboard showing CV scores for ten candidate models — that is descriptive. A training loss curve — descriptive. A confusion matrix — descriptive.
Use inferential statistics when you want to generalize. "Is Model A significantly better than Model B?" requires inference because your CV scores are a sample, and the difference you observe might be due to chance. A/B tests in production, hypothesis tests in research, confidence intervals on reported metrics — all inferential.
Python Example
import numpy as np
from scipy import stats
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
mean_acc = np.mean(accuracy)
std_acc = np.std(accuracy, ddof=1)
n = len(accuracy)
print(f"Descriptive — Mean: {mean_acc:.3f}")
print(f"Descriptive — Std Dev: {std_acc:.3f}")
se = stats.sem(accuracy)
ci = stats.t.interval(0.95, df=n-1, loc=mean_acc, scale=se)
print(f"\nInferential — 95% CI: ({ci[0]:.3f}, {ci[1]:.3f})")Descriptive — Mean: 0.838
Descriptive — Std Dev: 0.049
Inferential — 95% CI: (0.787, 0.890)
The descriptive part tells you the sample. The confidence interval is inferential: if you repeated this CV procedure many times with different random splits, roughly 95% of the intervals you constructed would contain the model's true generalization accuracy.
Related Concepts
The previous post introduced what statistics is. This post distinguishes its two branches. The population-sample distinction, which the next post covers in depth, is the foundation of inferential statistics. Without understanding what population you are trying to say something about, inferential results are uninterpretable. From there, central tendency and dispersion measures are the descriptive building blocks that inferential procedures like t-tests and confidence intervals are built on top of.
When This Framework Breaks Down
Inferential statistics assumes your sample is representative of the population. In ML, that means your CV folds must be drawn from the same distribution as future production data. If you train on 2020–2022 data and your CV folds are also from 2020–2022, but the model runs in 2025 on shifted data, no amount of correct inferential calculation will give you a reliable estimate of production performance. The statistical machinery is only as trustworthy as the sampling procedure.
Test Your Understanding
-
You compute the mean validation accuracy across 5 CV folds and get 0.83. A teammate says "our model achieves 83% accuracy." Is that statement descriptive or inferential? What additional information would make it a complete inferential claim?
-
Given
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88], compute the sample standard deviation by hand (divide by ). What does this number tell you that the mean alone does not? -
Suppose you add a seventh CV fold with accuracy 0.60 (a bad fold due to a skewed class distribution). How does this affect the mean? Should you remove this fold before reporting? What would removing it imply about your inferential claims?
-
Why does a 95% confidence interval become narrower when you increase from 6 CV folds to 30 CV folds? What property of inference does this illustrate?
If you are ready to go deeper, the next topic worth understanding is Population vs Sample — because almost everything in inferential statistics depends on how well you understand this distinction.
Previous: What is Statistics | Next: Population Vs Sample Data