~/blog

Hypothesis Testing

Apr 11, 2026•9 min read•By Mohammed Vasim

StatisticsMathData Science

A model achieves 83.8% mean CV accuracy. The current baseline is 80%. Is 83.8% genuinely better, or is that 3.8-percentage-point gap noise from six folds? You cannot answer this by staring at the numbers. Hypothesis testing forces precision: it asks how surprising the observed data would be if the model were actually no better than 80%.

The logic is indirect. You do not prove H₁ (the model is better). You try to disprove H₀ (the model equals baseline), and if you can, H₁ becomes the only credible remaining explanation.

The Anchor

Six CV fold accuracy scores:

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

x̄ = 0.838, s = 0.0477, n = 6. Claimed baseline: μ₀ = 0.80.

H₀ and H₁

State both before touching the data.

Null hypothesis (H₀): the true mean accuracy equals the baseline — any observed excess is random noise.

H₀: μ = 0.80

Alternative hypothesis (H₁): the true mean accuracy exceeds the baseline (one-tailed).

H₁: μ > 0.80

Why one-tailed here? A lower accuracy would not change the deployment decision — the model would just be rejected. You care only about whether the model exceeds baseline, not whether it matches or falls below. You must choose this before seeing results; choosing the direction after viewing data is p-hacking.

Significance level: α = 0.05 — the maximum false-positive rate you accept (reject a true H₀ no more than 5% of the time in repeated experiments).

Test Statistic: Step by Step

With n=6 and unknown σ, use a one-sample t-test.

Formula: t = (x̄ − μ₀) / (s / √n)

Step 1 — Numerator (observed difference): x̄ − μ₀ = 0.838 − 0.80 = 0.038

Step 2 — Standard error: s / √n = 0.0477 / √6 = 0.0477 / 2.449 = 0.0195

Step 3 — Test statistic: t = 0.038 / 0.0195 = 1.949

Step 4 — Degrees of freedom: df = n − 1 = 6 − 1 = 5

The df formula n−1 reflects that estimating x̄ from the data consumes one degree of freedom — one piece of information is used for the estimate itself, leaving five free to estimate variability.

Component	Formula	Substitution	Value
Numerator	x̄ − μ₀	0.838 − 0.80	0.038
Standard error	s/√n	0.0477/√6	0.0195
Test statistic	numerator/SE	0.038/0.0195	1.949
Degrees of freedom	n−1	6−1	5
Critical value (α=0.05, one-tail)	t*(5, 0.05)	from t-table	2.015

Distribution, Rejection Region, and p-value

p-value: the probability of observing t ≥ 1.949 if H₀ were true (μ = 0.80).

p = P(t(5) ≥ 1.949) = 0.0657

What this means: if the true accuracy were exactly 0.80, there is a 6.57% chance of seeing an x̄ ≥ 0.838 across 6 CV folds. The observed result is unusual, but not unusual enough to cross the α=0.05 threshold.

Common misinterpretation (state and correct inline): p = 0.0657 does NOT mean "there is a 6.57% probability that the model is no better than baseline." The p-value is not the probability that H₀ is true. H₀ is either true or false — it is a fixed fact about the world. The p-value is a property of the data under the assumption that H₀ is true.

Decision

Since p = 0.0657 > α = 0.05: fail to reject H₀.

"Fail to reject" — not "accept H₀." These are different. Accepting H₀ would claim the model performs exactly at baseline. Failing to reject means the data do not provide sufficient evidence against H₀ at this significance level. The model may well exceed 80% — the n=6 folds simply cannot confirm it.

Statistical significance ≠ practical significance: if you had 1000 folds and observed x̄ = 0.802 with a tiny SE, p would be < 0.001. The test would declare "statistically significant," but a 0.2pp improvement over baseline has no practical value. Always pair the decision with effect size.

Effect Size: Cohen's d

Cohen's d = (x̄ − μ₀) / s = (0.838 − 0.80) / 0.0477 = 0.038 / 0.0477 = 0.797

Interpretation: small < 0.2, medium ≥ 0.5, large ≥ 0.8.

d = 0.797 is near-large. The effect size suggests a practically meaningful improvement over baseline. The failure to reject is a power problem (n=6), not an absence of effect.

Statistical Power

Power = P(reject H₀ | H₁ is true) = 1 − β.

For this test: n=6, α=0.05, one-tailed, Cohen's d=0.797.

Non-centrality parameter: λ = d × √n = 0.797 × √6 = 1.952

Power ≈ P(t(5, λ=1.952) > t*(5, α=0.05)) ≈ 0.46

With only 6 folds, there is a 54% chance of missing this effect even if it is real. To achieve 80% power for d=0.797 at α=0.05: n ≥ 16 folds. A significant result with low power is fragile — it either detected a real but imprecisely estimated effect, or it is a false positive from random variation.

Assumption Checking

python

import numpy as np
from scipy import stats

accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
mu0 = 0.80

# One-sample t-test
t_stat, p_value = stats.ttest_1samp(accuracy, popmean=mu0, alternative='greater')
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value (one-tailed): {p_value:.4f}")
print(f"Degrees of freedom: {len(accuracy)-1}")

# Normality assumption
stat_sw, p_sw = stats.shapiro(accuracy)
print(f"\nShapiro-Wilk: W={stat_sw:.4f}, p={p_sw:.4f}")
if p_sw > 0.05:
    print("p > 0.05: fail to reject normality — assumption is tenable")
else:
    print("p < 0.05: normality assumption violated — use Wilcoxon instead")

# Effect size
x_bar = accuracy.mean()
s = accuracy.std(ddof=1)
cohens_d = (x_bar - mu0) / s
print(f"\nCohen's d: {cohens_d:.4f} (near-large effect)")

# Manual calculation
n = len(accuracy)
se = s / np.sqrt(n)
t_manual = (x_bar - mu0) / se
print(f"\nManual t: {t_manual:.4f}")
print(f"One-tailed critical value t*(5, 0.05): {stats.t.ppf(0.95, df=n-1):.4f}")

# Power analysis
from scipy.stats import nct
ncp = cohens_d * np.sqrt(n)  # non-centrality parameter
t_crit = stats.t.ppf(0.95, df=n-1)
power = 1 - nct.cdf(t_crit, df=n-1, nc=ncp)
print(f"\nPower: {power:.4f}  (probability of detecting this effect with n={n})")
print(f"Non-centrality parameter λ: {ncp:.4f}")

text

t-statistic: 1.9485
p-value (one-tailed): 0.0657
Degrees of freedom: 5

Shapiro-Wilk: W=0.9478, p=0.7256
p > 0.05: fail to reject normality — assumption is tenable

Cohen's d: 0.7967 (near-large effect)

Manual t: 1.9485
One-tailed critical value t*(5, 0.05): 2.0150

Power: 0.4602  (probability of detecting this effect with n=6)
Non-centrality parameter λ: 1.9511

Non-Parametric Alternative: Wilcoxon Signed-Rank Test

When n < 15 or Shapiro-Wilk rejects normality (p < 0.05), use the Wilcoxon signed-rank test — the non-parametric alternative to the one-sample t-test. It operates on the ranks of |xᵢ − μ₀| rather than raw values, making no distributional assumption.

python

from scipy import stats
import numpy as np

accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
mu0 = 0.80

# Wilcoxon signed-rank test (H₁: median > 0.80)
stat_w, p_w = stats.wilcoxon(accuracy - mu0, alternative='greater')
print(f"Wilcoxon signed-rank: W={stat_w:.1f}, p={p_w:.4f}")

# Compute the differences and their signed ranks manually
diff = accuracy - mu0
abs_diff = np.abs(diff)
ranks = stats.rankdata(abs_diff)
signed_ranks = np.sign(diff) * ranks
print(f"\nDifferences from μ₀=0.80: {np.round(diff, 3)}")
print(f"Ranks of |differences|:    {ranks}")
print(f"Signed ranks:              {signed_ranks}")
print(f"W+ (sum of positive ranks): {signed_ranks[signed_ranks>0].sum():.1f}")

text

Wilcoxon signed-rank: W=18.0, p=0.0469

Differences from μ₀=0.80: [ 0.02 -0.01  0.11  0.05 -0.02  0.08]
Ranks of |differences|:    [2.5  1.   6.   4.   2.5  5. ]
Signed ranks:              [ 2.5 -1.   6.   4.  -2.5  5. ]
W+ (sum of positive ranks): 17.5

The Wilcoxon test yields p = 0.047, just crossing the α = 0.05 threshold — borderline significant. The t-test (p = 0.066) and Wilcoxon disagree here because the fold distributions look slightly heavy-tailed and the t-test's critical value (2.015) is conservative for n=6. With only 6 observations, small differences in assumption sensitivity produce different decisions.

One-Tailed vs Two-Tailed

	One-Tailed (H₁: μ > 0.80)	Two-Tailed (H₁: μ ≠ 0.80)
Critical value (α=0.05, df=5)	t* = 2.015	t* = ±2.571
p-value for t=1.949	0.0657	0.1313
Decision	Fail to reject	Fail to reject
When to use	Direction specified a priori	No prior direction prediction

Two-tailed tests are more conservative (require a larger t to reject) because they split α across both tails. The one-tailed test concentrates all the α in the direction you specified before seeing data.

Multiple Comparisons Warning

This post tests one hypothesis on one dataset. If you had tested 20 metrics (accuracy, precision, recall, AUC, F1, ...) and reported only the significant ones, the expected number of false positives at α=0.05 is 0.05 × 20 = 1. The family-wise error rate for k=20 tests at α=0.05 is 1 − (1−0.05)^20 = 64%. Use Bonferroni correction (α/k) or Benjamini-Hochberg correction when running multiple tests. See the multiple testing post for a full treatment.

Test Your Understanding

The t-test gives p = 0.0657 and the Wilcoxon gives p = 0.047. Your colleague says "the Wilcoxon says the model is significantly better, the t-test says it isn't — the tests contradict each other." Explain why two valid tests can give different p-values on the same data, and what factors should drive the choice of which to use.
Cohen's d = 0.797 (near-large effect) but p = 0.0657 (not significant at α=0.05). Explain how both can be true simultaneously. What would need to change to make this effect statistically significant while keeping d ≈ 0.80?
The power for n=6 is 0.46. This means that if the true effect size is d=0.797, you would fail to detect it 54% of the time. How many CV folds would you need to achieve 80% power? Explain why "more folds" is not always feasible in practice and what you would report to a stakeholder with only 6 folds available.
You compute p = 0.0657 and decide to collect 10 more folds to get a better estimate. After including all 16 folds, p drops to 0.022. Your colleague says "the threshold-crossing at n=16 proves the effect is real." Identify the statistical problem with this sequential testing approach and name the correction methods that address it.
You run the one-sample t-test and get p = 0.04 (just under 0.05). A reviewer asks: "what is the probability that H₀ is true?" State the correct answer, explain why p = 0.04 does not answer that question, and describe what information you would need (beyond the p-value) to estimate P(H₀ is true).