~/blog

Types of Statistics

Apr 11, 2026•9 min read•By Mohammed Vasim

StatisticsMathData Science

When you train a model, you collect performance numbers. Then you have to decide: what do these numbers actually mean? That question splits into two fundamentally different acts, and statistics has a branch for each.

The first act is describing what you measured. The second is inferring something about what you did not measure. Getting these confused — treating descriptive results as inferential claims, or skipping description entirely — is one of the most common sources of misleading ML papers and engineering decisions.

The Anchor Dataset

Throughout this post, every example uses six cross-validation accuracy scores from a classifier:

python

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

These represent six folds of cross-validation. Each value is the fraction of test examples classified correctly in that fold.

Descriptive Statistics: Summarizing What You Have

When you summarize the six CV scores with a mean or a chart, you are doing descriptive statistics. You are not making any guesses beyond the data in front of you. You are organizing it so a human brain can grasp it.

The mean of those six folds:

$\overset{x}{ˉ} = \frac{0.82 + 0.79 + 0.91 + 0.85 + 0.78 + 0.88}{6} = \frac{5.03}{6} \approx 0.838$

That 0.838 is a descriptive statistic. It tells you the average performance across those specific six folds — nothing more. No claim is made about how the model would perform on a different set of test examples.

Inferential Statistics: Generalizing Beyond Your Data

But you do not usually care only about those six specific folds. You care about how the model will perform on new, unseen data — data from the real world that was never in your training or test sets. Making a claim about that requires inferential statistics.

The sample mean 0.838 is used to estimate the model's true generalization accuracy across all possible future test data. Because you only measured six folds, there is uncertainty. A 95% confidence interval quantifies that uncertainty:

$C I_{95%} = \overset{x}{ˉ} \pm t^{*} \cdot \frac{s}{n}$

The sample mean is the bridge. You compute it (descriptive) and use it to estimate the population parameter (inferential). The quality of the inference depends entirely on the quality of the descriptive work.

Notation Differences

You will encounter different symbols depending on whether you are working with a population or a sample:

Concept	Population	Sample
Mean	$μ$	$\overset{x}{ˉ}$
Standard Deviation	$σ$	$s$
Size	$N$	$n$

The population mean $μ$ is the true generalization accuracy of the model — a fixed but unknown quantity. The sample mean $\overset{x}{ˉ} = 0.838$ is an estimate that would change if you ran different CV folds. This randomness in sample statistics is exactly what allows us to quantify uncertainty in inferential work.

When the Distinction Gets Ignored

Three failure modes that show up constantly in ML engineering:

Reporting training accuracy as generalization. Training accuracy is a descriptive statistic about the training set. It measures how well the model fits data it was optimized on — not how well it will perform on data it has never seen. This is a descriptive fact about one specific dataset. Treating it as inferential — "our model achieves 91% accuracy" — conflates the sample with the population.

No uncertainty on benchmark results. A model achieves 94.2% on a test set. Another achieves 93.8%. Which is better? Without inferential statistics — a confidence interval or significance test — you have two descriptive facts that cannot be meaningfully compared. The difference might be noise from a particular test set split.

Non-representative sample, population-level claim. A model evaluated only on data from one geographic region produces accurate descriptive statistics for that region. Treating those results as representative of global performance is an inferential error: the sample was not drawn from the population being claimed about.

Parametric vs Nonparametric

Within inferential statistics, there is a second axis: whether the method assumes the data follows a specific probability distribution.

Parametric methods assume the data follows a known distribution — usually normal — with a finite set of parameters (μ, σ). The test statistic is derived from those distributional assumptions. Examples: z-test, t-test, ANOVA, Pearson correlation. When assumptions hold, parametric methods extract more information from the same number of observations.

Nonparametric methods make no distributional assumptions. They work on ranks or counts rather than raw values. Use them when normality is rejected, sample size is small (n < 30), data is ordinal, or there are severe outliers.

Parametric Test	Nonparametric Equivalent	When to Switch
One-sample t-test	Wilcoxon signed-rank	Non-normal data, n < 30
Two-sample t-test	Mann-Whitney U	Non-normal, unequal variances
One-way ANOVA	Kruskal-Wallis	Non-normal, multiple groups
Pearson r	Spearman ρ	Non-linear monotone relationship

The decision rule: check distributional assumptions first. If they hold, use the parametric method. If they fail, use the nonparametric equivalent.

python

from scipy import stats

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
stat, p = stats.shapiro(accuracy)
print(f"Shapiro-Wilk: W={stat:.4f}, p={p:.4f}")

text

Shapiro-Wilk: W=0.9761, p=0.9319

p = 0.93 — do not reject normality. Parametric methods are appropriate for this dataset. If p were below 0.05, the Wilcoxon signed-rank test would replace the one-sample t-test.

Levels of Measurement

The scale of your data determines which statistical methods are valid. Using arithmetic on ordinal data — computing the "average" severity level — is a category error that produces meaningless results.

Scale	Properties	ML Example	Valid Operations	Parametric Valid?
Nominal	Categories, no order	Error type, model name	Counts, mode, chi-square	No
Ordinal	Ordered, unequal gaps	Severity (low / med / high)	Median, percentiles, rank tests	No (on raw values)
Interval	Equal gaps, no true zero	Temperature (°C)	Mean, SD, t-tests	Yes
Ratio	Equal gaps + true zero	Accuracy, loss, latency	All operations	Yes

The accuracy dataset is ratio scale — 0 means genuinely zero accuracy, and 0.88 is truly 1.07× more accurate than 0.82. Most ML metrics are ratio scale, which is why the full suite of parametric methods applies to them.

Python Example: Descriptive and Inferential Together

python

import numpy as np
from scipy import stats

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

mean_acc = np.mean(accuracy)
std_acc = np.std(accuracy, ddof=1)
n = len(accuracy)

print(f"Descriptive — Mean:    {mean_acc:.3f}")
print(f"Descriptive — Std Dev: {std_acc:.3f}")

se = stats.sem(accuracy)
ci = stats.t.interval(0.95, df=n-1, loc=mean_acc, scale=se)
print(f"\nInferential — 95% CI: ({ci[0]:.3f}, {ci[1]:.3f})")

text

Descriptive — Mean:    0.838
Descriptive — Std Dev: 0.051

Inferential — 95% CI: (0.787, 0.890)

The descriptive part tells you the sample. The confidence interval is inferential: if you repeated this CV procedure many times with different random splits, roughly 95% of the intervals you constructed would contain the model's true generalization accuracy.

Types of Statistics at a Glance

Type	Branch	Distributional Assumption	Examples	When to Use
Descriptive	Descriptive	None	Mean, SD, histogram	Summarizing observed data
Parametric	Inferential	Normal (usually)	z-test, t-test, ANOVA, Pearson r	Normal data, comparing means
Nonparametric	Inferential	None	Mann-Whitney, Wilcoxon, Spearman ρ	Non-normal, ordinal, small n
Bayesian	Inferential	Prior distribution	Beta-Binomial, Bayesian ANOVA	Incorporating prior knowledge

The previous post introduced statistics as reasoning under uncertainty. This post shows the full taxonomy: descriptive vs inferential, then parametric vs nonparametric within the inferential branch. The population-sample distinction, which the next post covers in depth, is the foundation of inferential statistics. Without understanding what population you are trying to reason about, inferential results are uninterpretable. Measures of central tendency and dispersion are the descriptive building blocks that parametric tests like t-tests are built on top of.

When This Framework Breaks Down

Inferential statistics assumes your sample is representative of the population. In ML, that means your CV folds must be drawn from the same distribution as future production data. If you train on 2020–2022 data and your CV folds are also from 2020–2022, but the model runs in 2025 on shifted data, no amount of correct inferential calculation will give you a reliable estimate of production performance. The statistical machinery is only as trustworthy as the sampling procedure.

Test Your Understanding

You compute the mean validation accuracy across 5 CV folds and get 0.83. A teammate says "our model achieves 83% accuracy." Is that statement descriptive or inferential? What additional information would make it a complete inferential claim?
Given accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88], the Shapiro-Wilk test returns p = 0.93. Would you use a one-sample t-test or the Wilcoxon signed-rank test to check if mean accuracy exceeds 0.80? Why?
A colleague reports model alert severity ratings (low / medium / high / critical) encoded as 1, 2, 3, 4 and computes the arithmetic mean. What level of measurement applies here, and what is wrong with computing the mean?
Why does a 95% confidence interval become narrower when you increase from 6 CV folds to 30 CV folds? What property of inference does this illustrate?
Two models each achieve a mean CV accuracy of 0.84. Without running a significance test, can you claim one is better than the other? What type of statistics would you need to support that claim, and why?

Types of Statistics

The Anchor Dataset

Descriptive Statistics: Summarizing What You Have

Inferential Statistics: Generalizing Beyond Your Data

Notation Differences

When the Distinction Gets Ignored

Parametric vs Nonparametric

Levels of Measurement

Python Example: Descriptive and Inferential Together

Types of Statistics at a Glance

When This Framework Breaks Down

Test Your Understanding

Comments (0)

Leave a comment

Types of Statistics

The Anchor Dataset

Descriptive Statistics: Summarizing What You Have

Inferential Statistics: Generalizing Beyond Your Data

Notation Differences

When the Distinction Gets Ignored

Parametric vs Nonparametric

Levels of Measurement

Python Example: Descriptive and Inferential Together

Types of Statistics at a Glance

Related Concepts

When This Framework Breaks Down

Test Your Understanding

Comments (0)

Leave a comment