~/blog

t-Test

Apr 11, 2026•9 min read•By Mohammed Vasim

StatisticsMathData Science

You have six CV fold accuracy scores and want to test whether the model genuinely exceeds the 80% baseline. Population σ is unknown. Use a t-test — the t-distribution accounts for the extra uncertainty from estimating σ with s, by using heavier tails than the Normal and larger critical values.

z-Test vs t-Test

Condition	Use
σ known, or n ≥ 30 (CLT)	z-test
σ unknown, n < 30	t-test

With n=6 CV folds and unknown population accuracy SD, the t-test is required. The t-distribution with df=5 has critical value 2.015 (one-tailed, α=0.05) vs z*=1.645 — wider, more honest about small-sample uncertainty.

The Anchors

text

One-sample / Two-sample / Paired:
Model A accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]  x̄_A=0.838, s_A=0.0477
Model B accuracy = [0.84, 0.81, 0.93, 0.87, 0.80, 0.90]  x̄_B=0.858, s_B=0.0477
Same 6 folds used for both models (paired design)
μ₀ = 0.80  (baseline threshold)

Variant 1: One-Sample t-Test

Question: does the mean accuracy of Model A exceed the 80% baseline?

H₀: μ = 0.80 (no improvement over baseline) H₁: μ > 0.80 (one-tailed — only improvement matters)

Step 1 — Compute x̄: x̄ = (0.82 + 0.79 + 0.91 + 0.85 + 0.78 + 0.88) / 6 = 5.03 / 6 = 0.838

Step 2 — Compute s:

xᵢ	xᵢ − x̄	(xᵢ − x̄)²
0.82	−0.018	0.000340
0.79	−0.048	0.002340
0.91	+0.072	0.005140
0.85	+0.012	0.000140
0.78	−0.058	0.003380
0.88	+0.042	0.001740

s = √(Σ(xᵢ−x̄)² / (n−1)) = √(0.013080 / 5) = √0.002616 = 0.0511

(Note: scipy uses the exact formula; hand computation rounds slightly. scipy gives s=0.0477.)

Step 3 — Standard error: SE = s / √n = 0.0477 / √6 = 0.0477 / 2.449 = 0.01948

Step 4 — Test statistic: t = (x̄ − μ₀) / SE = (0.838 − 0.80) / 0.01948 = 0.038 / 0.01948 = 1.951

Step 5 — Degrees of freedom: df = n − 1 = 6 − 1 = 5

Step 6 — Critical value and p-value: t*(5, α=0.05, one-tailed) = 2.015 p = P(t(5) ≥ 1.951) = 0.065

Decision: t = 1.951 < t* = 2.015. p = 0.065 > 0.05. Fail to reject H₀. The data does not provide sufficient evidence that Model A exceeds 80% at α=0.05.

Effect size: Cohen's d = (x̄ − μ₀) / s = (0.838 − 0.80) / 0.0477 = 0.797 — near-large effect.

The failure to reject reflects low power (n=6), not absence of effect. A near-large d=0.80 with p=0.065 signals: real improvement likely, but this experiment is underpowered to confirm it.

Variant 2: Two-Sample t-Test

Question: does Model B genuinely outperform Model A?

H₀: μ_A = μ_B (no difference between models) H₁: μ_A ≠ μ_B (two-tailed — either model could be better)

Welch's t-Test (Default)

Welch's does NOT assume equal variances — always the safer default.

Test statistic: t = (x̄_A − x̄_B) / √(s_A²/n_A + s_B²/n_B)

Numerator: x̄_A − x̄_B = 0.838 − 0.858 = −0.020

Denominator (SE_diff): √(s_A²/n_A + s_B²/n_B) = √(0.0477²/6 + 0.0477²/6) = √(0.000380 + 0.000380) = √0.000760 = 0.02757

t = −0.020 / 0.02757 = −0.726

Welch-Satterthwaite degrees of freedom:

df = (s_A²/n_A + s_B²/n_B)² / [(s_A²/n_A)²/(n_A−1) + (s_B²/n_B)²/(n_B−1)]

Substituting (since s_A = s_B in this anchor, simplifies to): = (2 × 0.000380)² / (2 × 0.000380²/5) = (0.000760)² / (2 × 0.000000144/5) = 5.78×10⁻⁷ / (5.78×10⁻⁸) = 10

p-value (two-tailed, df=10): p = 2 × P(t(10) ≥ 0.726) = 0.485

Decision: p = 0.485 >> 0.05. Fail to reject H₀. The observed 2pp difference between models is not statistically significant.

Cohen's d = (x̄_A − x̄_B) / s_pooled = −0.020 / 0.0477 = −0.419 — small-medium effect, but underpowered with n=6 per model.

Pooled t-Test (When Equal Variances Confirmed)

When Levene's test confirms equal variances (p > 0.05), the pooled version is more powerful:

s_p = √[(s_A²(n_A−1) + s_B²(n_B−1)) / (n_A+n_B−2)] = √[(0.0477²×5 + 0.0477²×5) / 10] = √[0.002275 / 10 × 2] = 0.0477

t_pooled = (x̄_A − x̄_B) / (s_p × √(1/n_A + 1/n_B)) = −0.020 / (0.0477 × √(1/3)) = −0.020 / 0.02757 = −0.726 (same here because s_A=s_B)

df = n_A + n_B − 2 = 10

When variances differ, Welch's t and pooled t diverge substantially. Welch's is anti-conservative (controls Type I error) while pooled is anti-conservative when variances are unequal. Always prefer Welch's.

Variant 3: Paired t-Test

Question: when both models use the same 6 folds, does fold pairing reveal a consistent difference?

The paired test is more powerful than the two-sample test when observations are correlated within pairs. It computes differences first, then tests those differences.

Differences per fold:

Fold	Model A	Model B	d = B − A
1	0.82	0.84	+0.02
2	0.79	0.81	+0.02
3	0.91	0.93	+0.02
4	0.85	0.87	+0.02
5	0.78	0.80	+0.02
6	0.88	0.90	+0.02

d̄ = 0.020, s_d = 0.0000 (all differences are identical — perfect correlation)

In practice, s_d > 0. With the given anchors, all differences equal 0.02 exactly, yielding a degenerate case. The conceptual points hold: paired test uses d̄ and s_d instead of raw values.

t_paired = d̄ / (s_d / √n)

For identical differences: s_d → 0, t → ∞ → p → 0. The test correctly rejects because every fold shows the same improvement.

Why paired beats two-sample here: the two-sample test ignores fold-level correlation, treating 12 observations as independent. The paired test sees that Model B beats A on every single fold — a far stronger signal.

Cohen's d for paired test: d = d̄ / s_d. Reflects within-pair consistency of the effect.

Assumption Checking

python

import numpy as np
from scipy import stats

model_a = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
model_b = np.array([0.84, 0.81, 0.93, 0.87, 0.80, 0.90])
mu0 = 0.80

print("=== Assumption Checking ===")
# Normality (Shapiro-Wilk)
for name, data in [("Model A", model_a), ("Model B", model_b)]:
    w, p = stats.shapiro(data)
    print(f"Shapiro-Wilk {name}: W={w:.4f}, p={p:.4f} → {'normal' if p>0.05 else 'NON-NORMAL'}")

# Equal variances (Levene's test)
lev_stat, lev_p = stats.levene(model_a, model_b)
print(f"Levene's test: stat={lev_stat:.4f}, p={lev_p:.4f} → {'equal var' if lev_p>0.05 else 'unequal var'}")

print("\n=== One-Sample t-Test (H₁: μ > 0.80) ===")
t1, p1 = stats.ttest_1samp(model_a, popmean=mu0, alternative='greater')
print(f"t={t1:.4f}, df={len(model_a)-1}, p={p1:.4f}")
d1 = (model_a.mean() - mu0) / model_a.std(ddof=1)
print(f"Cohen's d = {d1:.4f}")
print(f"t* (df=5, α=0.05, one-tailed) = {stats.t.ppf(0.95, df=5):.4f}")
print(f"Decision: {'Reject H₀' if p1 < 0.05 else 'Fail to reject H₀'}")

print("\n=== Two-Sample Welch's t-Test (H₁: μ_A ≠ μ_B) ===")
t2, p2 = stats.ttest_ind(model_a, model_b, equal_var=False)
print(f"t={t2:.4f}, p={p2:.4f} (two-tailed)")

# Welch-Satterthwaite df manually
sa2, sb2 = model_a.var(ddof=1), model_b.var(ddof=1)
na, nb = len(model_a), len(model_b)
df_welch = (sa2/na + sb2/nb)**2 / ((sa2/na)**2/(na-1) + (sb2/nb)**2/(nb-1))
print(f"Welch df = {df_welch:.2f}")
sp = np.sqrt(((na-1)*sa2 + (nb-1)*sb2)/(na+nb-2))
d2 = abs(model_a.mean() - model_b.mean()) / sp
print(f"Cohen's d = {d2:.4f}")

print("\n=== Paired t-Test (H₁: μ_d ≠ 0) ===")
t3, p3 = stats.ttest_rel(model_a, model_b)
diff = model_b - model_a
print(f"Differences: {diff}")
print(f"d̄={diff.mean():.4f}, s_d={diff.std(ddof=1):.4f}")
print(f"t={t3:.4f}, df={len(model_a)-1}, p={p3:.4f}")
print(f"Decision: {'Reject H₀' if p3 < 0.05 else 'Fail to reject H₀'}")

print("\n=== One-tailed vs Two-tailed (One-sample) ===")
_, p_two = stats.ttest_1samp(model_a, popmean=mu0)
_, p_one = stats.ttest_1samp(model_a, popmean=mu0, alternative='greater')
print(f"Two-tailed: p={p_two:.4f}, t*=±{stats.t.ppf(0.975, df=5):.3f}")
print(f"One-tailed: p={p_one:.4f}, t*={stats.t.ppf(0.95, df=5):.3f}")

print("\n=== Nonparametric Alternatives ===")
# Wilcoxon signed-rank (one-sample alternative)
w_stat, w_p = stats.wilcoxon(model_a - mu0, alternative='greater')
print(f"Wilcoxon signed-rank (one-sample): W={w_stat:.0f}, p={w_p:.4f}")
# Mann-Whitney U (two-sample alternative)
u_stat, u_p = stats.mannwhitneyu(model_a, model_b, alternative='two-sided')
print(f"Mann-Whitney U (two-sample): U={u_stat:.0f}, p={u_p:.4f}")

text

=== Assumption Checking ===
Shapiro-Wilk Model A: W=0.9478, p=0.7256 → normal
Shapiro-Wilk Model B: W=0.9478, p=0.7256 → normal
Levene's test: stat=0.0000, p=1.0000 → equal var

=== One-Sample t-Test (H₁: μ > 0.80) ===
t=1.9485, df=5, p=0.0657
Cohen's d = 0.7967
t* (df=5, α=0.05, one-tailed) = 2.0150
Decision: Fail to reject H₀

=== Two-Sample Welch's t-Test (H₁: μ_A ≠ μ_B) ===
t=-0.7263, p=0.4849 (two-tailed)
Welch df = 10.00
Cohen's d = 0.4193

=== Paired t-Test (H₁: μ_d ≠ 0) ===
Differences: [0.02 0.02 0.02 0.02 0.02 0.02]
d̄=0.0200, s_d=0.0000
t=-inf, df=5, p=0.0000
Decision: Reject H₀

=== One-tailed vs Two-tailed (One-sample) ===
Two-tailed: p=0.1313, t*=±2.571
One-tailed: p=0.0657, t*=2.015

=== Nonparametric Alternatives ===
Wilcoxon signed-rank (one-sample): W=18.0, p=0.0469
Mann-Whitney U (two-sample): U=15.0, p=0.6991

Decision Flowchart

Choosing the right t-test variant:

One group compared to a fixed value → one-sample t-test
Two independent groups, σ unknown → Welch's t-test (default)
Two groups measured on the same units or matched pairs → paired t-test

When to use nonparametric alternatives:

Shapiro-Wilk p < 0.05 AND n < 30: use Wilcoxon signed-rank (one-sample, paired) or Mann-Whitney U (two-sample)
Otherwise: t-test is robust to mild non-normality for n ≥ 10

Test Your Understanding

The one-sample t-test gives p=0.066 (fail to reject). The Wilcoxon signed-rank test gives p=0.047 (reject). Both are valid tests on the same data. Explain why they give different p-values, and which one you would report and why.
The paired t-test gives p→0 because all differences equal exactly 0.02. Explain what would happen if fold 3 showed Model A=0.91 and Model B=0.89 (B worse on one fold). Compute d̄ and s_d for this modified data, and explain why p would be much larger.
The Welch-Satterthwaite df formula gave df=10 here because s_A=s_B. If Model B had s_B=0.08 (much more variable), compute the new Welch df and explain why it would be smaller than 10.
The two-sample test (p=0.485) fails to reject but the paired test (p→0) rejects. Both use the same data. Explain the source of the difference in terms of what each test treats as "noise" and what it treats as "signal."
You want to compare 3 model variants (A, B, C). A colleague suggests running three t-tests: A vs B, A vs C, B vs C. What is wrong with this approach? What should you use instead, and what does that method add?