~/blog

What is Statistics

Apr 11, 2026•10 min read•By Mohammed Vasim

StatisticsMathData Science

Every time you pick a machine learning model based on cross-validation scores, you are using statistics. When you compare two models and say "Model A achieves 84% accuracy on average across five folds," you are summarizing data. When you then say "Model A is likely better than Model B in production," you are making an inference. Both of those acts — summarizing and inferring — are what statistics does.

The problem is that most introductions to statistics skip straight to formulas without explaining why those formulas exist. This post builds the conceptual foundation: what statistics is for, how it works, and why learning it changes how you think about data.

The Anchor Dataset

Throughout this series, every calculation will use the same six cross-validation accuracy scores from a classification model:

python

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

These represent six folds of cross-validation on a binary classifier. Each value is the fraction of test-set examples the model classified correctly in that fold. This is a small, realistic ML dataset — the kind you generate every time you call cross_val_score in scikit-learn.

Cross-validation scores were chosen deliberately: they have natural variation (0.78–0.91), a meaningful domain interpretation, no repeating values (which makes the mode discussion interesting later), and are small enough (n=6) to trace every calculation by hand. Every statistical concept in this series — mean, variance, t-test, confidence interval — applies directly to this dataset and produces answers a practitioner actually cares about.

What Statistics Actually Is

Statistics is the discipline of reasoning under uncertainty. Unlike pure mathematics with its exact proofs, statistics works with imperfect data and draws conclusions that are probably right, not definitely right. This is not a weakness — it is the whole point.

The concrete problem it solves: without statistics, you cannot tell whether one model's 84% accuracy is genuinely better than another's 82%, or just noise from which folds happened to be selected. Statistics gives you the tools to make that distinction clearly and honestly.

Two Branches

Statistics splits into two main areas, and the distinction matters for how you interpret any result.

Descriptive statistics summarizes what you have. No claims beyond the data you collected.

Applied to the anchor: "The mean accuracy across these 6 folds is 0.838. The lowest fold scored 0.78. The highest scored 0.91." That's it — no claim about what would happen with more folds or different test data.

Inferential statistics uses a sample to reason about a larger population. Applied to the anchor: "These 6 CV folds are a sample of how this model would perform across all possible test conditions. We can estimate the model's true generalization accuracy with 95% confidence as roughly [0.78, 0.90]." That is an inference — a claim beyond the data, with stated uncertainty.

Descriptive statistics tells you what the data says. Inferential statistics tells you what the data means beyond itself.

Aspect	Descriptive	Inferential
Goal	Summarize the data at hand	Reason about the larger population
Tools	Mean, SD, histogram, percentiles	t-tests, CIs, p-values, regression
Anchor result	x̄ = 0.838, range = 0.13	95% CI: [0.78, 0.90]
Claims beyond sample	No	Yes — with stated uncertainty

Types of Data

The statistical tools that apply to a dataset depend entirely on what kind of data you have. Reach for the wrong tool and the result is meaningless — or wrong in subtle ways.

Data falls into two top-level categories:

Categorical (Qualitative) — values that represent categories, not amounts.

Nominal: no natural ordering. Examples: error type (timeout / null_ptr / OOM), model architecture name (ResNet / ViT / LSTM). The only valid operations are frequency counts and mode — you cannot say "timeout is greater than null_ptr."
Ordinal: ordered categories, but the gaps between them are not equal. Example: alert severity (low / medium / high / critical). You can say high is worse than medium, but not by how much. Frequency, median, and mode are valid. Arithmetic mean is not — the average of "low" and "high" is not "medium" in any meaningful sense.

Numerical (Quantitative) — values that represent measurable amounts.

Discrete: countable values, no fractions between integers. Example: number of errors in a batch (0, 1, 2, 3…), number of estimators in a random forest. All arithmetic operations apply.
Continuous: any value in a range, including fractions. Example: accuracy (0.82, 0.7834…), inference latency (42.7 ms). All arithmetic plus density-based measures (histograms, KDE) apply.

One more distinction that matters for ML metrics — interval vs ratio scale:

Interval: equal gaps between values, but no true zero. Temperature in Celsius is interval: 40°C is 20° hotter than 20°C, but it is not "twice as hot" because 0°C is not "no temperature."
Ratio: equal gaps AND a true zero. Accuracy, response time, and error count are all ratio scale. A model with 0.88 accuracy is genuinely 1.07× more accurate than one at 0.82, and a latency of 0 ms means truly no latency.

This distinction matters because most ML metrics — accuracy, F1, loss, latency — are ratio scale. Ratios between values are meaningful, which is why saying "model A is 6% better" is a valid claim while "20°C is twice as cold as 40°C" is not.

Type	Sub-type	ML Example	Valid Operations
Categorical	Nominal	Model name, error type	Count, mode
Categorical	Ordinal	Severity level	Count, median, mode
Numerical	Discrete	Error count, n_estimators	All arithmetic
Numerical	Continuous	Accuracy, loss, latency	All arithmetic + density

The anchor dataset — accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88] — is numerical, continuous, ratio scale. Every statistical operation in this series applies to it.

The Statistical Process

Statistics is not just about calculations. The typical flow applies a sequence of steps, and skipping any of them leads to meaningless results:

Question formulation: "Is this model's CV accuracy reliably above 0.80?"
Data collection: 6 CV folds on a binary classifier → [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
Data organization: sorted array [0.78, 0.79, 0.82, 0.85, 0.88, 0.91], range = 0.13
Analysis: compute mean = 0.838, SD ≈ 0.047, run one-sample t-test against μ₀ = 0.80
Interpretation: p = 0.02, reject H₀ — likely above 0.80 in production
Presentation: report 95% CI [0.78, 0.90] with sample size and test details

Plenty of ML projects jump from step 2 straight to step 6, skipping the analysis and interpretation entirely. The result is a number without meaning.

Population and Sample

Two terms you will see throughout this series:

Population: all possible evaluation conditions — every possible test set, all future users, all future traffic this model will encounter. You almost never have this.
Sample: the data you actually have. Here, 6 CV folds.
Parameter: a number describing the population. The model's true generalization accuracy μ — unknown.
Statistic: a number computed from the sample. The sample mean x̄ = 0.838 — known.

The full treatment of population vs sample is its own post. For now, just carry this vocabulary: a statistic estimates a parameter, and inferential statistics is the machinery for making that estimation rigorous.

Basic Terminology

Variable: something that can take different values (accuracy, loss)
Observation: a single record (one CV fold's accuracy score)
Dataset: a collection of observations (accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88])

Computing Basic Summaries

python

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

mean_acc = sum(accuracy) / len(accuracy)
sorted_acc = sorted(accuracy)
median_acc = (sorted_acc[2] + sorted_acc[3]) / 2

print(f"Mean accuracy:   {mean_acc:.3f}")
print(f"Median accuracy: {median_acc:.3f}")
print(f"Min accuracy:    {min(accuracy):.3f}")
print(f"Max accuracy:    {max(accuracy):.3f}")

text

Mean accuracy:   0.838
Median accuracy: 0.835
Min accuracy:    0.780
Max accuracy:    0.910

The mean and median are close here, which suggests the accuracy scores are roughly symmetric — no single fold is dramatically pulling the average in one direction. You will quantify exactly how spread out these values are in the next post on dispersion.

Where This Goes Wrong

Reporting best-fold performance as model performance. If one CV fold gives 0.91 accuracy and the others average 0.82, the model is an 0.82 model, not a 0.91 model. The maximum is a descriptive statistic too — just a misleading one to report in isolation.

More data is not always better. A biased sample of 100,000 observations is still biased. The quality of your CV setup — whether folds are stratified, whether there is data leakage between folds — matters more than raw quantity.

Statistics does not prove anything. It reveals patterns and quantifies uncertainty. Interpretation always matters.

Confusing the process steps. If your CV folds are not independent — for instance, time-series data without proper temporal splitting — the statistics you compute do not generalize to unseen data, regardless of how correctly you calculate them. The math can be perfect while the setup is broken.

This post is the foundation for the entire descriptive statistics series. Measures of central tendency (mean, median, mode), dispersion (variance, standard deviation), and percentiles all build on the vocabulary and framing here. The anchor dataset — accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88] — carries through every post, so by the time you reach confidence intervals and hypothesis testing in the inferential series, every concept connects back to something you have already calculated by hand.

If you find any of this confusing right now, that is normal: the distinction between a population parameter and a sample statistic is genuinely subtle. It becomes concrete once you start computing confidence intervals and realize you are estimating an unknown μ from a known x̄.

Test Your Understanding

You run 10-fold cross-validation and get accuracy scores ranging from 0.74 to 0.93. What single number would you report as the model's accuracy, and why?
A colleague reports that their model achieves "95% accuracy" on a dataset of 1,000 examples where 950 belong to class A and 50 to class B. What descriptive statistic is missing from this report that would make it meaningful?
Given accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88], compute the range (max minus min). Does this range tell you whether the model's performance is consistent across folds? What additional measure would you need?
Two models each achieve a mean CV accuracy of 0.84. Model A's fold scores range from 0.81 to 0.87. Model B's range from 0.70 to 0.96. Which model would you trust more for production deployment, and why does descriptive statistics alone not answer this question completely?
The accuracy scores in this post are ratio-scale data. What would change about the valid operations if, instead of accuracy, you were analyzing severity levels (low / medium / high / critical) assigned to model errors?

What is Statistics

The Anchor Dataset

What Statistics Actually Is

Two Branches

Types of Data

The Statistical Process

Population and Sample

Basic Terminology

Computing Basic Summaries

Where This Goes Wrong

Test Your Understanding

Comments (0)

Leave a comment

What is Statistics

The Anchor Dataset

What Statistics Actually Is

Two Branches

Types of Data

The Statistical Process

Population and Sample

Basic Terminology

Computing Basic Summaries

Where This Goes Wrong

Related Concepts

Test Your Understanding

Comments (0)

Leave a comment