← View series: statistics
~/blog
t-Distribution
You are evaluating a new model variant on a small internal benchmark: 8 evaluation folds, because generating quality annotations for more is expensive. You do not know the true variance of evaluation scores across possible folds — you only have what your 8 folds tell you. If you use the Z-distribution and pretend you know the true variance, your confidence intervals will be too narrow, making you overconfident in the model's real-world performance. The t-distribution corrects exactly this mistake.
The t-distribution was derived by William Sealy Gosset, who worked at Guinness Brewery and published under the pseudonym "Student" — his employer did not want competitors knowing their statistical methods. He needed to make quality decisions from small batches of barley samples. The problem he solved is identical to yours: inference from small samples when you do not know the true population spread.
Why The Extra Uncertainty Matters
When you compute a Z-statistic with known , you get exactly. But when you replace with the sample standard deviation , you are replacing a fixed number with a random variable that itself has sampling error.
That extra source of uncertainty has to go somewhere. It inflates the tails of the sampling distribution. The t-distribution is what emerges from this: heavier tails than the normal, reflecting the honest admission that you do not know .
For the t-statistic:
where is the sample standard deviation. This follows a t-distribution with degrees of freedom.
The Dataset
Suppose you have 8 cross-validation fold F1 scores for a classification model:
Sample mean:
Sample std:
You want to estimate the true expected F1 score with a 95% confidence interval.
Degrees of Freedom: Why ?
When you calculate sample variance:
you lose one degree of freedom because the deviations from the mean must sum to zero:
Only of these deviations are free to vary. The last one is determined by the others. For your 8 folds, .
How Tails Change with Degrees of Freedom
With small , the t-distribution has much heavier tails than the normal. Those heavier tails produce wider confidence intervals — appropriately reflecting that you have less information about .
| df | Ratio | ||
|---|---|---|---|
| 3 | 3.182 | 1.96 | 1.62 |
| 5 | 2.571 | 1.96 | 1.31 |
| 7 | 2.365 | 1.96 | 1.21 |
| 10 | 2.228 | 1.96 | 1.14 |
| 30 | 2.042 | 1.96 | 1.04 |
| 1.96 | 1.96 | 1.00 |
For your 8 folds (), the critical value is 2.365 — 21% larger than the Z critical value. Using the normal distribution would give you a confidence interval that is 21% too narrow.
Confidence Interval Calculation
For the 8 fold F1 scores:
| Phase | Formula | Values | Result |
|---|---|---|---|
| Standard error | |||
| Critical value | df=7, 95% CI | ||
| Margin of error | |||
| Confidence interval |
If you incorrectly used Z: , giving — 17% too narrow, giving false precision.
The PDF and Key Properties
The probability density function:
Key properties:
- Symmetric around 0
- Bell-shaped with heavier tails than normal
- Shape depends only on degrees of freedom
- As : converges to standard normal
- Mean: (for )
- Variance: (for ) — always greater than 1
Special cases: at the t-distribution becomes the Cauchy distribution, which has no mean or variance at all. At , variance is infinite. This is why very small samples ( or ) produce extraordinarily wide confidence intervals — that width is telling the truth about how little you know.
Python Code
import numpy as np
from scipy import stats
# 8 cross-validation fold F1 scores
fold_scores = np.array([0.821, 0.847, 0.835, 0.812, 0.859, 0.828, 0.841, 0.816])
n = len(fold_scores)
df = n - 1
x_bar = fold_scores.mean()
s = fold_scores.std(ddof=1)
se = s / np.sqrt(n)
t_crit = stats.t.ppf(0.975, df)
z_crit = stats.norm.ppf(0.975)
me_t = t_crit * se
me_z = z_crit * se
print(f"n={n}, df={df}")
print(f"Mean F1: {x_bar:.4f}")
print(f"Sample std: {s:.4f}")
print(f"Standard error: {se:.5f}")
print(f"t critical (df=7): {t_crit:.4f}")
print(f"Z critical: {z_crit:.4f}")
print(f"95% CI using t: [{x_bar - me_t:.4f}, {x_bar + me_t:.4f}]")
print(f"95% CI using Z: [{x_bar - me_z:.4f}, {x_bar + me_z:.4f}] <-- too narrow")
# Critical value comparison across df values
print("\nCritical value by df:")
for df_val in [3, 5, 7, 10, 30, 100]:
t_val = stats.t.ppf(0.975, df_val)
print(f" df={df_val:3d}: t={t_val:.4f} ratio to Z={t_val/z_crit:.4f}")n=8, df=7
Mean F1: 0.8324
Sample std: 0.0158
Standard error: 0.00559
t critical (df=7): 2.3646
Z critical: 1.9600
95% CI using t: [0.8192, 0.8455]
95% CI using Z: [0.8215, 0.8433] <-- too narrow
Critical value by df:
df= 3: t=3.1824 ratio to Z=1.6236
df= 5: t=2.5706 ratio to Z=1.3115
df= 7: t=2.3646 ratio to Z=1.2064
df= 10: t=2.2281 ratio to Z=1.1368
df= 30: t=2.0423 ratio to Z=1.0420
df=100: t=1.9840 ratio to Z=1.0122
Robustness and When It Fails
The t-test is more robust to non-normality than its derivation suggests:
- Robust to: Moderate skewness (especially with balanced designs), slight non-normality with
- Not robust to: Severe outliers (they inflate ), extreme skewness with very small samples, large variance differences when comparing groups
For the 8 fold scores, if one fold produced F1 = 0.5 due to a data split issue, it would drag up substantially and widen the confidence interval substantially — which is actually correct behavior, because a model with high variance across folds is genuinely uncertain.
Beyond One Dimension
For multivariate problems, there is the multivariate t-distribution, used in robust regression and Bayesian inference with heavy-tailed priors. The intuition is the same: heavier tails than the multivariate Normal, to account for additional uncertainty in the covariance structure.
Related Concepts
The t-distribution directly extends the Z-test (post 5) by replacing known with estimated , producing a distribution with heavier tails. The t-tests (post 7) apply this distribution to one-sample, two-sample, and paired designs. Welch's t-test (also in post 7) uses the Welch-Satterthwaite approximation to handle unequal variances — a further application of degree-of-freedom adjustments. The t-distribution is also the basis for t-based confidence intervals (post 11) and for the t-statistics reported in regression output, making it one of the most pervasive distributions in applied statistics.
Honest Limitations
The t-distribution assumes the underlying data are approximately Normal — specifically, that the test statistic follows a t-distribution, which holds when data are Normal or sample sizes are large enough for CLT to apply. With 8 folds and severely skewed scores, consider non-parametric alternatives: Wilcoxon signed-rank test for a one-sample comparison, or simply bootstrap confidence intervals that require no distributional assumptions.
Test Your Understanding
- You add two more evaluation folds ( total). The sample mean and std remain the same. How does the 95% confidence interval width change, and why?
- A model achieves mean F1 = 0.85 with on 5 folds. Using the t-distribution, construct a 95% CI. Would you report this model as having F1 above 0.80 with 95% confidence?
- Why is the variance of the t-distribution equal to , which is always greater than 1 for any finite ? What does this imply about the tails?
- Two research teams report the same sample mean and standard error for a model comparison, but Team A used folds and Team B used folds. Team A's 95% CI is wider. Is this a flaw in Team A's analysis or correct behavior?
- When is it acceptable to use Z instead of t for computing confidence intervals on model performance metrics, and what is the practical consequence of using Z when you should use t?