← View series: statistics
~/blog
Statistics
Every time you pick a machine learning model based on cross-validation scores, you are using statistics. When you compare two models and say "Model A achieves 84% accuracy on average across five folds," you are summarizing data. When you then say "Model A is likely better than Model B in production," you are making an inference. Both of those acts—summarizing and inferring—are what statistics does.
The problem is that most introductions to statistics skip straight to formulas without explaining why those formulas exist. This post builds the conceptual foundation: what statistics is for, how it works, and why learning it changes how you think about data.
The Anchor Dataset
Throughout this series, every calculation will use the same six cross-validation accuracy scores from a classification model:
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
These represent six folds of cross-validation on a binary classifier. Each value is the fraction of test-set examples the model classified correctly in that fold. This is a small, realistic ML dataset — the kind you generate every time you call cross_val_score in scikit-learn.
What Statistics Actually Is
Statistics is the practice of collecting, organizing, analyzing, interpreting, and presenting data. The goal is to extract meaningful insights from raw information — which sounds obvious, but the discipline gets nuanced quickly.
The key distinction worth sitting with: statistics lives in the territory of uncertainty. Unlike pure mathematics with its exact proofs, statistics works with imperfect data and draws conclusions that are probably right, not definitely right. This is not a weakness — it is the whole point. Real-world data is messy, and statistics gives you tools to reason clearly despite that mess.
Two Branches, One Goal
Statistics splits into two main areas:
Descriptive statistics summarizes and describes what you have. If you have six accuracy scores from cross-validation, descriptive statistics tells you the mean accuracy, how much it varies across folds, and what the distribution looks like. It does not try to generalize beyond the data you have.
Inferential statistics uses a sample to say something about a larger population. If those six CV folds are a sample of how the model would perform across all possible test data, inferential statistics lets you estimate the model's true generalization accuracy with some stated confidence.
Descriptive statistics tells you what the data says. Inferential statistics tells you what the data means beyond itself.
The Process Matters
Data analysis is not just about running calculations. The typical flow looks like this:
Skipping steps or doing them poorly leads to garbage. Plenty of ML projects jump straight to complex models without understanding the data first — it rarely ends well.
Basic Terminology You Will Need
- Data: Facts, numbers, or measurements you collect for analysis
- Variable: Something that can take different values (like
accuracyorloss) - Observation: A single record or measurement (one CV fold's accuracy score)
- Dataset: A collection of observations (
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
Where Statistics Shows Up in ML and DS
Cross-validation: You train a model six times on different splits and record accuracy each time. The mean of those six numbers is a descriptive statistic. The confidence interval around that mean is inferential statistics.
A/B testing models: You deploy two versions of a recommendation model and measure click-through rate. Statistical tests tell you whether the difference in rates is real or just noise. This is practical — the decision to ship or rollback depends on it.
Hyperparameter comparison: You tune learning rate and record validation loss across runs. Descriptive statistics summarizes those runs. Inferential statistics helps you decide whether one learning rate is genuinely better.
Reporting model performance: Saying "the model achieves 84.5% accuracy" is a descriptive claim. Saying "we expect the model to generalize to 83–86% accuracy on unseen data" is an inferential claim. Both are statistics.
Getting Started with Code
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
mean_acc = sum(accuracy) / len(accuracy)
sorted_acc = sorted(accuracy)
median_acc = (sorted_acc[2] + sorted_acc[3]) / 2
print(f"Mean accuracy: {mean_acc:.3f}")
print(f"Median accuracy: {median_acc:.3f}")
print(f"Min accuracy: {min(accuracy):.3f}")
print(f"Max accuracy: {max(accuracy):.3f}")Mean accuracy: 0.838
Median accuracy: 0.835
Min accuracy: 0.780
Max accuracy: 0.910
The mean and median are close here, which suggests the accuracy scores are roughly symmetric — no single fold is dramatically pulling the average in one direction.
The Shift in Thinking That Matters
When I first started learning statistics, I expected clean formulas and definitive answers. What actually happens is that you develop intuition for uncertainty. You learn to ask "how confident are we?" instead of "is this true?"
In ML, this shift is critical. A model that achieves 91% accuracy in one fold is not a "91% accurate model" — it is a model with one data point about its performance. The full picture requires understanding the distribution of that performance, not just the peak.
Common Mistakes
Reporting best-fold performance as model performance. If one CV fold gives 91% accuracy and the others average 82%, the model is an 82% model, not a 91% model. Always report the mean (or median), not the maximum.
More data is not always better. A biased sample of 100,000 observations is still biased. The quality of your CV setup — whether folds are stratified, whether there is data leakage — matters more than raw quantity.
Statistics does not prove anything. It reveals patterns and quantifies uncertainty. Interpretation always matters, and that is where expertise and honest reporting are essential.
Related Concepts
This post sits at the beginning of the descriptive statistics series. Everything that follows — measures of central tendency, dispersion, percentiles, histograms — builds on the vocabulary and framing introduced here. Once you finish the descriptive series, inferential statistics (confidence intervals, hypothesis testing, regression) becomes much more accessible because you will already understand the distinction between a sample statistic and a population parameter.
When This Framework Breaks Down
Descriptive statistics only describes what you measured. If your CV folds are not independent — for instance, if you are doing time-series data without proper temporal splitting — the statistics you compute do not generalize to unseen data, regardless of how correctly you calculate them. The math can be perfect while the setup is broken. Always think about whether your data collection procedure supports the inferences you want to make.
Test Your Understanding
-
You run 10-fold cross-validation and get accuracy scores ranging from 0.74 to 0.93. What single number would you report as the model's accuracy, and why?
-
A colleague reports that their model achieves "95% accuracy" on a dataset of 1,000 examples where 950 belong to class A and 50 to class B. What descriptive statistic is missing from this report that would make it meaningful?
-
Given
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88], compute the range (max minus min). Does this range tell you whether the model's performance is consistent across folds? What additional measure would you need? -
Two models each achieve a mean CV accuracy of 0.84. Model A's fold scores range from 0.81 to 0.87. Model B's range from 0.70 to 0.96. Which model would you trust more for production deployment, and why does descriptive statistics alone not answer this question completely?
The next topic gets into the types of statistics — descriptive and inferential statistics and when you would use each.