Back to blog
← View series: statistics

~/blog

t-Distribution

Apr 11, 20267 min readBy mohammed.vasim
StatisticsMathData Science

You are evaluating a new model variant on a small internal benchmark: 8 evaluation folds, because generating quality annotations for more is expensive. You do not know the true variance of evaluation scores across possible folds — you only have what your 8 folds tell you. If you use the Z-distribution and pretend you know the true variance, your confidence intervals will be too narrow, making you overconfident in the model's real-world performance. The t-distribution corrects exactly this mistake.

The t-distribution was derived by William Sealy Gosset, who worked at Guinness Brewery and published under the pseudonym "Student" — his employer did not want competitors knowing their statistical methods. He needed to make quality decisions from small batches of barley samples. The problem he solved is identical to yours: inference from small samples when you do not know the true population spread.

Why The Extra Uncertainty Matters

When you compute a Z-statistic with known , you get exactly. But when you replace with the sample standard deviation , you are replacing a fixed number with a random variable that itself has sampling error.

That extra source of uncertainty has to go somewhere. It inflates the tails of the sampling distribution. The t-distribution is what emerges from this: heavier tails than the normal, reflecting the honest admission that you do not know .

For the t-statistic:

where is the sample standard deviation. This follows a t-distribution with degrees of freedom.

The Dataset

Suppose you have 8 cross-validation fold F1 scores for a classification model:

Sample mean:

Sample std:

You want to estimate the true expected F1 score with a 95% confidence interval.

Degrees of Freedom: Why ?

When you calculate sample variance:

you lose one degree of freedom because the deviations from the mean must sum to zero:

Only of these deviations are free to vary. The last one is determined by the others. For your 8 folds, .

How Tails Change with Degrees of Freedom

With small , the t-distribution has much heavier tails than the normal. Those heavier tails produce wider confidence intervals — appropriately reflecting that you have less information about .

dfRatio
33.1821.961.62
52.5711.961.31
72.3651.961.21
102.2281.961.14
302.0421.961.04
1.961.961.00

For your 8 folds (), the critical value is 2.365 — 21% larger than the Z critical value. Using the normal distribution would give you a confidence interval that is 21% too narrow.

0 -5 5 t=3.182 (df=3) t=2.365 (df=7) t=2.042 (df=30) Z=1.96 t (df=3) t (df=7) t (df=30) / Normal Heavier tails at low df push critical values outward — wider CIs, more honest uncertainty

Confidence Interval Calculation

For the 8 fold F1 scores:

PhaseFormulaValuesResult
Standard error
Critical valuedf=7, 95% CI
Margin of error
Confidence interval

If you incorrectly used Z: , giving — 17% too narrow, giving false precision.

The PDF and Key Properties

The probability density function:

Key properties:

  • Symmetric around 0
  • Bell-shaped with heavier tails than normal
  • Shape depends only on degrees of freedom
  • As : converges to standard normal
  • Mean: (for )
  • Variance: (for ) — always greater than 1

Special cases: at the t-distribution becomes the Cauchy distribution, which has no mean or variance at all. At , variance is infinite. This is why very small samples ( or ) produce extraordinarily wide confidence intervals — that width is telling the truth about how little you know.

Python Code

python
import numpy as np
from scipy import stats

# 8 cross-validation fold F1 scores
fold_scores = np.array([0.821, 0.847, 0.835, 0.812, 0.859, 0.828, 0.841, 0.816])
n = len(fold_scores)
df = n - 1

x_bar = fold_scores.mean()
s = fold_scores.std(ddof=1)
se = s / np.sqrt(n)

t_crit = stats.t.ppf(0.975, df)
z_crit = stats.norm.ppf(0.975)

me_t = t_crit * se
me_z = z_crit * se

print(f"n={n}, df={df}")
print(f"Mean F1: {x_bar:.4f}")
print(f"Sample std: {s:.4f}")
print(f"Standard error: {se:.5f}")
print(f"t critical (df=7): {t_crit:.4f}")
print(f"Z critical:         {z_crit:.4f}")
print(f"95% CI using t: [{x_bar - me_t:.4f}, {x_bar + me_t:.4f}]")
print(f"95% CI using Z: [{x_bar - me_z:.4f}, {x_bar + me_z:.4f}]  <-- too narrow")

# Critical value comparison across df values
print("\nCritical value by df:")
for df_val in [3, 5, 7, 10, 30, 100]:
    t_val = stats.t.ppf(0.975, df_val)
    print(f"  df={df_val:3d}: t={t_val:.4f}  ratio to Z={t_val/z_crit:.4f}")
n=8, df=7 Mean F1: 0.8324 Sample std: 0.0158 Standard error: 0.00559 t critical (df=7): 2.3646 Z critical: 1.9600 95% CI using t: [0.8192, 0.8455] 95% CI using Z: [0.8215, 0.8433] <-- too narrow Critical value by df: df= 3: t=3.1824 ratio to Z=1.6236 df= 5: t=2.5706 ratio to Z=1.3115 df= 7: t=2.3646 ratio to Z=1.2064 df= 10: t=2.2281 ratio to Z=1.1368 df= 30: t=2.0423 ratio to Z=1.0420 df=100: t=1.9840 ratio to Z=1.0122

Robustness and When It Fails

The t-test is more robust to non-normality than its derivation suggests:

  • Robust to: Moderate skewness (especially with balanced designs), slight non-normality with
  • Not robust to: Severe outliers (they inflate ), extreme skewness with very small samples, large variance differences when comparing groups

For the 8 fold scores, if one fold produced F1 = 0.5 due to a data split issue, it would drag up substantially and widen the confidence interval substantially — which is actually correct behavior, because a model with high variance across folds is genuinely uncertain.

Beyond One Dimension

For multivariate problems, there is the multivariate t-distribution, used in robust regression and Bayesian inference with heavy-tailed priors. The intuition is the same: heavier tails than the multivariate Normal, to account for additional uncertainty in the covariance structure.

The t-distribution directly extends the Z-test (post 5) by replacing known with estimated , producing a distribution with heavier tails. The t-tests (post 7) apply this distribution to one-sample, two-sample, and paired designs. Welch's t-test (also in post 7) uses the Welch-Satterthwaite approximation to handle unequal variances — a further application of degree-of-freedom adjustments. The t-distribution is also the basis for t-based confidence intervals (post 11) and for the t-statistics reported in regression output, making it one of the most pervasive distributions in applied statistics.

Honest Limitations

The t-distribution assumes the underlying data are approximately Normal — specifically, that the test statistic follows a t-distribution, which holds when data are Normal or sample sizes are large enough for CLT to apply. With 8 folds and severely skewed scores, consider non-parametric alternatives: Wilcoxon signed-rank test for a one-sample comparison, or simply bootstrap confidence intervals that require no distributional assumptions.

Test Your Understanding

  1. You add two more evaluation folds ( total). The sample mean and std remain the same. How does the 95% confidence interval width change, and why?
  2. A model achieves mean F1 = 0.85 with on 5 folds. Using the t-distribution, construct a 95% CI. Would you report this model as having F1 above 0.80 with 95% confidence?
  3. Why is the variance of the t-distribution equal to , which is always greater than 1 for any finite ? What does this imply about the tails?
  4. Two research teams report the same sample mean and standard error for a model comparison, but Team A used folds and Team B used folds. Team A's 95% CI is wider. Is this a flaw in Team A's analysis or correct behavior?
  5. When is it acceptable to use Z instead of t for computing confidence intervals on model performance metrics, and what is the practical consequence of using Z when you should use t?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment