~/blog

t-Distribution

Apr 11, 2026•12 min read•By Mohammed Vasim

StatisticsMathData Science

You have 8 F1 scores from an internal benchmark. You do not know the true variance of model performance across all possible evaluation folds — you only have what these 8 scores tell you. The t-distribution exists specifically to handle this situation honestly.

The Problem It Solves

When you compute Z = (X̄ − μ) / (σ/√n), this follows N(0,1) exactly — but only when σ is known. In practice, σ is unknown. You substitute S (the sample SD), giving:

T = (X̄ − μ) / (S/√n)

S is a random variable — it varies from sample to sample. That extra uncertainty has to go somewhere. It inflates the tails of the sampling distribution. The t-distribution captures this: heavier tails than the Normal, reflecting that you estimated σ from the same small data you are making inferences from.

William Sealy Gosset derived this distribution in 1908 while working at Guinness Brewery. He published under the pseudonym "Student" to keep his employer's methods confidential. His problem was quality decisions from small batches of barley — identical to the small-n evaluation problem in ML.

The Anchor

python

f1_scores = [0.821, 0.847, 0.835, 0.812, 0.859, 0.828, 0.841, 0.816]
# n=8, x̄=0.832, s=0.0162, df=7

n=8 is intentional: this is where the t-distribution matters most.

The Formula

f(t; ν) = [Γ((ν+1)/2)] / [√(νπ) · Γ(ν/2)] · (1 + t²/ν)^(−(ν+1)/2)

where ν (nu) = degrees of freedom = n − 1.

Γ (Gamma function): a generalization of factorial. Γ(n) = (n−1)! for integers. It appears in normalization — ensures the density integrates to 1.

The tail-shaping term: (1 + t²/ν)^(−(ν+1)/2). Compare to the Normal's tail: e^(−t²/2). The Normal decays exponentially; the t-distribution decays polynomially. As |t| grows, the t-distribution's tail decays much slower — that is the heavier tail. Larger ν makes the polynomial decay faster, approaching Normal behavior.

The parameter: the t-distribution has ONE free parameter, ν. It is always centered at 0. Unlike the Normal (which has μ and σ), you cannot shift or scale the t-distribution — only its shape changes with df.

Degrees of Freedom — Why n−1

When you compute s = √[Σ(xᵢ − x̄)²/(n−1)], the denominator is n−1, not n. You estimated x̄ from the data, consuming one degree of freedom. The deviations (xᵢ − x̄) must sum to zero — so the last deviation is not free. Only n−1 deviations carry independent information.

For n=8 folds: df = 8 − 1 = 7.

How Shape Changes with df

ν (df)	Sample size n	Tail weight	Notes
1	2	Heaviest	Cauchy distribution — no finite mean
2	3	Very heavy	Infinite variance
5	6	Noticeably heavy	Getting closer to Normal
7	8	Moderate (our anchor)	Var=1.4, 40% more than Normal
10	11	Moderate tails	Roughly Normal for most purposes
30	31	Nearly Normal	Very close
∞	∞	Normal tails	t → N(0,1) exactly

Heavy Tails Quantified

How much heavier are t(df=7) tails vs Normal at the same critical values?

| Critical value | Normal P(|Z| > z*) | t(df=7) P(|T| > z*) | |---------------|-------------------|---------------------| | z=1.645 | 10.0% | 14.3% | | z=1.960 | 5.0% | 9.1% | | z=2.365 | 2.2% | 5.0% | | z=2.576 | 1.0% | 4.4% |

Key message: with df=7, you need t=2.365 to achieve the same 5% tail probability that z=1.960 achieves under the Normal. If you used Normal critical values with df=7, your true Type I error rate would be 9.1%, not 5%.

T-Statistic Computed on the Anchor

This is not a test yet — just computing the statistic to see where it lands on the t-distribution.

Step 1 — Compute x̄: x̄ = (0.821 + 0.847 + 0.835 + 0.812 + 0.859 + 0.828 + 0.841 + 0.816) / 8 = 6.659 / 8 = 0.832

Step 2 — Deviations from x̄:

fold	xᵢ	xᵢ − x̄	(xᵢ − x̄)²
1	0.821	−0.011	0.000121
2	0.847	+0.015	0.000225
3	0.835	+0.003	0.000009
4	0.812	−0.020	0.000400
5	0.859	+0.027	0.000729
6	0.828	−0.004	0.000016
7	0.841	+0.009	0.000081
8	0.816	−0.016	0.000256

Step 3 — Sample SD: s = √(Σ(xᵢ − x̄)² / (n−1)) = √(0.001837 / 7) = √0.0002624 = 0.0162

Step 4 — Standard error: SE = s / √n = 0.0162 / √8 = 0.0162 / 2.828 = 0.00573

Step 5 — T-statistic (vs μ₀ = 0.80): T = (x̄ − μ₀) / SE = (0.832 − 0.80) / 0.00573 = 0.032 / 0.00573 = 5.59

With df=7, P(T ≥ 5.59) < 0.001. This is deep in the right tail.

Mean and Variance

E[T] = 0 for ν > 1. (Undefined for ν=1, the Cauchy — no finite mean.)

Var[T] = ν/(ν−2) for ν > 2. (Undefined for ν ≤ 2 — infinite variance.)

For the anchor (ν=7): Var[T] = 7/(7−2) = 7/5 = 1.4 — 40% more variance than the standard Normal (Var=1). As ν → ∞, Var → 1.

Relationships to Other Distributions

1. t → N(0,1) as ν → ∞. At ν=30, for most practical purposes, t-table and z-table are nearly interchangeable.

2. T²(ν) ~ F(1, ν). The square of a t-statistic with ν degrees of freedom follows an F-distribution with parameters (1, ν). This connects t-tests to F-tests: a one-sample t-test is equivalent to a one-way ANOVA with two groups.

3. ν=1 is the Cauchy distribution. No finite mean, no finite variance. Never make inferences from n=2 (df=1). The Cauchy has pathologically heavy tails and extreme sensitivity to single observations.

CDF and Critical Values

How to read a t-table: find the row for your df, the column for your α and tail choice.

For df=7:

α=0.05, one-tailed: t* = 1.895
α=0.05, two-tailed: t* = 2.365 (the CDF at t=2.365 equals 0.975, so 2.5% is in each tail)
α=0.01, two-tailed: t* = 3.499

T vs Z Critical Value Comparison

α	Tails	z*	t*(df=7)	t*(df=30)
0.10	one	1.282	1.415	1.310
0.05	one	1.645	1.895	1.697
0.05	two	1.960	2.365	2.042
0.01	two	2.576	3.499	2.750

Every t* is larger than z*. The difference is the cost of not knowing σ — and that cost shrinks as n grows.

Code

python

import numpy as np
from scipy import stats

f1_scores = [0.821, 0.847, 0.835, 0.812, 0.859, 0.828, 0.841, 0.816]
n = len(f1_scores)
x_bar = np.mean(f1_scores)
s = np.std(f1_scores, ddof=1)
df = n - 1
SE = s / np.sqrt(n)

print(f"n={n}, x̄={x_bar:.4f}, s={s:.4f}, SE={SE:.5f}, df={df}")

# T-statistic against mu_0 = 0.80
mu_0 = 0.80
T = (x_bar - mu_0) / SE
p_one_tail = 1 - stats.t.cdf(T, df=df)
p_two_tail = 2 * (1 - stats.t.cdf(abs(T), df=df))
print(f"T = {T:.3f}, p(one-tail) = {p_one_tail:.5f}, p(two-tail) = {p_two_tail:.5f}")

# Critical values for df=7
for alpha, tails in [(0.05, 'one'), (0.05, 'two'), (0.01, 'two')]:
    tail_prob = alpha if tails == 'one' else alpha/2
    t_crit = stats.t.ppf(1 - tail_prob, df=df)
    z_crit = stats.norm.ppf(1 - tail_prob)
    print(f"α={alpha} ({tails}-tailed): t*={t_crit:.3f} vs z*={z_crit:.3f} (difference: {t_crit - z_crit:.3f})")

# Variance of t-distribution at different df
print("\nVariance of t(df) vs N(0,1) [Var=1.0]:")
for nu in [2, 5, 7, 10, 30, 100]:
    if nu > 2:
        var_t = nu / (nu - 2)
        print(f"  df={nu}: Var={var_t:.3f}")

# Tail probability at z=1.960 for t(df=7) vs Normal
z = 1.960
p_normal = 2 * (1 - stats.norm.cdf(z))
p_t7 = 2 * (1 - stats.t.cdf(z, df=7))
print(f"\nP(|Z|>1.960) under Normal: {p_normal:.4f}")
print(f"P(|T|>1.960) under t(df=7): {p_t7:.4f}  ← actual Type I error if you used z*")

text

n=8, x̄=0.8323, s=0.0162, SE=0.00573, df=7
T = 5.621, p(one-tail) = 0.00049, p(two-tail) = 0.00097
α=0.05 (one-tailed): t*=1.895 vs z*=1.645 (difference: 0.250)
α=0.05 (two-tailed): t*=2.365 vs z*=1.960 (difference: 0.405)
α=0.01 (two-tailed): t*=3.499 vs z*=2.576 (difference: 0.923)

Variance of t(df) vs N(0,1) [Var=1.0]:
  df=2: Var=2.000
  df=5: Var=1.667
  df=7: Var=1.400
  df=10: Var=1.250
  df=30: Var=1.071
  df=100: Var=1.020

P(|Z|>1.960) under Normal: 0.0500
P(|T|>1.960) under t(df=7): 0.0905  ← actual Type I error if you used z*

Honest Limitations

The t-distribution assumes normality of the underlying data — more precisely, that X̄ is approximately Normal, which holds by CLT for moderate n. With n=8 and severely skewed data (one fold producing F1=0.45 due to a data-split issue), the t-statistic does not follow t(df=7) exactly. In that case, bootstrap confidence intervals make no distributional assumptions and are more reliable.

With df=1 (n=2), Var[T] is infinite — the t-distribution is the Cauchy. Any inference from n=2 is essentially worthless: confidence intervals are infinitely wide.

Test Your Understanding

For the anchor (df=7), P(|T| > 1.960) = 0.0905 but P(|T| > 2.365) = 0.050. If a research paper reports p=0.042 from a study with n=8, using a two-tailed t-test, what is the minimum T-statistic they must have observed? Is this consistent with T=5.59 from the anchor?
A colleague evaluates a model on n=4 folds and computes a 95% CI using the t-distribution. The CI is [0.72, 0.95]. Another colleague uses n=30 folds with the same sample mean and SD and gets [0.81, 0.86]. Explain the width difference quantitatively: what are the two reasons the n=4 CI is wider?
The relationship T²(ν) ~ F(1, ν) means a t-test and an ANOVA are equivalent for two groups. Verify this numerically: if T=5.59 with df=7, what F-statistic does this correspond to, and what is the equivalent ANOVA statement?
Var[T] = ν/(ν−2). As ν → ∞, Var → 1 (matching the Normal). For ν=2, Var is infinite. What does it mean practically for model evaluation that a study with n=3 folds (df=2) has infinite variance in its T-statistic?
You compute T=2.0 on an 11-fold evaluation (df=10). The two-tailed p-value is 0.074. Your colleague computes T=2.0 on a 31-fold evaluation (df=30). Their two-tailed p-value is 0.055. Same T, different p-values. Explain why, and what this implies about reporting T-statistics without p-values.