← View series: statistics
~/blog
Non-Parametric Tests
You run Shapiro-Wilk on your six CV fold scores and get p=0.043. The normality assumption just failed. Your dataset is too small for the Central Limit Theorem to cover you. Every parametric test you've seen so far — t-test, z-test, ANOVA — assumes normality in the underlying distribution or a large enough sample to invoke the CLT. Non-parametric tests replace that assumption with a weaker one: they work on the ranks of the data rather than the raw values. Fewer assumptions, but at a cost.
Decision Framework
Non-parametric tests are the fallback when parametric assumptions fail — not the default.
Assumption check protocol (in order):
- Shapiro-Wilk test: if p < 0.05, reject normality
- Sample size: if n < 30, CLT doesn't rescue a non-normal distribution
- Outliers: if severe outliers exist, the t-test mean is unreliable
- Scale: if data is ordinal (not interval/ratio), ranks are more meaningful than means
The cost of non-parametric tests: they discard magnitude information by replacing raw values with ranks. When the parametric assumptions hold, this costs you statistical power — you need a larger sample to detect the same effect. A non-parametric test is not automatically "safer"; it just trades one set of assumptions for another.
Anchor Datasets
import numpy as np
from scipy import stats
# One-sample: does median accuracy exceed 0.80?
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
# Two-sample: is Model B better than Model A?
model_a = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
model_b = [0.84, 0.81, 0.93, 0.87, 0.80, 0.90]
# k-sample: do three models differ?
model_c = [0.78, 0.75, 0.87, 0.81, 0.74, 0.84]Wilcoxon Signed-Rank Test
Replaces: one-sample t-test or paired t-test.
H₀: the population median = η₀. H₁: median > η₀ (or ≠ η₀ two-sided).
Test question: does the median CV fold accuracy significantly exceed 0.80?
Algorithm
- Compute differences dᵢ = xᵢ − η₀
- Discard differences = 0 (reduce n accordingly)
- Rank |dᵢ| from smallest to largest; ties get average ranks
- W⁺ = sum of ranks of positive dᵢ; W⁻ = sum of ranks of negative dᵢ
- Under H₀: W⁺ and W⁻ should be roughly equal (both ≈ n(n+1)/4)
Step-by-Step on the Anchor (η₀ = 0.80)
| Fold | xᵢ | dᵢ = xᵢ − 0.80 | |dᵢ| | Rank | Signed Rank | |------|----|-----------------|------|------|-------------| | 1 | 0.82 | +0.02 | 0.02 | 2.5 | +2.5 | | 2 | 0.79 | −0.01 | 0.01 | 1 | −1 | | 3 | 0.91 | +0.11 | 0.11 | 6 | +6 | | 4 | 0.85 | +0.05 | 0.05 | 4 | +4 | | 5 | 0.78 | −0.02 | 0.02 | 2.5 | −2.5 | | 6 | 0.88 | +0.08 | 0.08 | 5 | +5 |
Ranks 2 and 3 are tied at |d|=0.02 → average rank = 2.5 assigned to both.
W⁺ = 2.5 + 6 + 4 + 5 = 17.5 (positive differences dominate) W⁻ = 1 + 2.5 = 3.5
Check: W⁺ + W⁻ = 21 = n(n+1)/2 = 6×7/2 ✓
Under H₀: E[W⁺] = n(n+1)/4 = 10.5. Our W⁺=17.5 is well above that — evidence the median exceeds 0.80.
Code
import numpy as np
from scipy import stats
accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
stat, p = stats.wilcoxon(accuracy - 0.80, alternative='greater')
print(f"Wilcoxon signed-rank: W={stat:.1f}, p={p:.4f}")Wilcoxon signed-rank: W=17.5, p=0.0469
p=0.047 < 0.05 → Reject H₀. The median fold accuracy significantly exceeds 0.80 at the 5% level.
Mann-Whitney U Test
Replaces: independent two-sample t-test (Welch's).
H₀: P(X_A > X_B) = 0.5. H₁: P(X_A > X_B) ≠ 0.5 (two-sided).
Important misconception: Mann-Whitney tests whether one distribution tends to produce larger values than the other — not whether the medians are equal. The median-equality interpretation is valid only when both distributions have the same shape. In practice (different shapes or scales), these hypotheses are different.
Algorithm
- Combine all n_A + n_B observations and rank from smallest to largest
- R_A = sum of ranks assigned to Group A observations
- U_A = n_A × n_B + n_A(n_A+1)/2 − R_A
- U_B = n_A × n_B − U_A
- Test statistic: U = min(U_A, U_B) for two-sided test
Step-by-Step on the Anchor
All 12 values sorted with group labels:
| Rank | Value | Group | Rank | Value | Group |
|---|---|---|---|---|---|
| 1 | 0.78 | A | 7 | 0.85 | A |
| 2 | 0.79 | A | 8 | 0.87 | B |
| 3 | 0.80 | B | 9 | 0.88 | A |
| 4 | 0.81 | B | 10 | 0.90 | B |
| 5 | 0.82 | A | 11 | 0.91 | A |
| 6 | 0.84 | B | 12 | 0.93 | B |
R_A = 1+2+5+7+9+11 = 35 R_B = 3+4+6+8+10+12 = 43
U_A = 6×6 + 6×7/2 − 35 = 36 + 21 − 35 = 22 U_B = 36 − 22 = 14 U = min(22, 14) = 14
Effect size: r = Z / √(n_A + n_B). Using the normal approximation of U: E[U] = n_A n_B / 2 = 18, SD[U] = √(n_A n_B (n_A+n_B+1)/12) = √39 = 6.245 Z = (U_A − 18) / 6.245 = 4 / 6.245 = 0.640 r = 0.640 / √12 = 0.185 (small effect)
Code
stat_u, p_mw = stats.mannwhitneyu(model_a, model_b, alternative='two-sided')
print(f"Mann-Whitney U: U={stat_u:.1f}, p={p_mw:.4f}")Mann-Whitney U: U=22.0, p=0.3941
p=0.394 → Fail to reject H₀. The two models do not show a statistically significant difference in fold scores at n=6.
Kruskal-Wallis Test
Replaces: one-way ANOVA.
H₀: all k populations have the same distribution. H₁: at least one differs.
The analogy to ANOVA: H is the rank-based analogue of the F-statistic — it measures how much the group mean ranks deviate from the overall mean rank, weighted by group size.
Formula
H = (12 / (N(N+1))) × Σᵢ nᵢ (R̄ᵢ − R̄)²
where R̄ = (N+1)/2 is the overall mean rank and R̄ᵢ is the mean rank of group i. Under H₀, H ~ χ²(k−1) approximately (for nᵢ ≥ 5).
Step-by-Step on All Three Models (N=18)
All 18 values sorted with ties handled by average ranks:
| Rank | Value | Group | Rank | Value | Group | |
|---|---|---|---|---|---|---|
| 1 | 0.74 | C | 10.5 | 0.84 | B | |
| 2 | 0.75 | C | 10.5 | 0.84 | C | |
| 3.5 | 0.78 | A | 12 | 0.85 | A | |
| 3.5 | 0.78 | C | 13.5 | 0.87 | B | |
| 5 | 0.79 | A | 13.5 | 0.87 | C | |
| 6 | 0.80 | B | 15 | 0.88 | A | |
| 7.5 | 0.81 | B | 16 | 0.90 | B | |
| 7.5 | 0.81 | C | 17 | 0.91 | A | |
| 9 | 0.82 | A | 18 | 0.93 | B |
R̄_A = (3.5+5+9+12+15+17)/6 = 61.5/6 = 10.25 R̄_B = (6+7.5+10.5+13.5+16+18)/6 = 71.5/6 = 11.92 R̄_C = (1+2+3.5+7.5+10.5+13.5)/6 = 38/6 = 6.33 Overall R̄ = (18+1)/2 = 9.5
H = (12/(18×19)) × [6×(10.25−9.5)² + 6×(11.92−9.5)² + 6×(6.33−9.5)²] = 0.03509 × [6×0.5625 + 6×5.857 + 6×10.049] = 0.03509 × [3.375 + 35.14 + 60.29] = 0.03509 × 98.81 = 3.468
df = k − 1 = 2. χ²(0.05, 2) = 5.991. Since H=3.468 < 5.991 → fail to reject.
Post-Hoc: Dunn's Test
Post-hoc pairwise comparisons after a significant Kruskal-Wallis use Dunn's test — pairwise Mann-Whitney with multiple-comparison correction.
from scikit_posthocs import posthoc_dunn
data = model_a + model_b + model_c
groups = ['A']*6 + ['B']*6 + ['C']*6
result = posthoc_dunn(data, group_col=groups, p_adjust='bonferroni')
print(result)# Bonferroni-corrected pairwise p-values (run only after significant KW):
A B C
A 1.000000 1.000000 0.612023
B 1.000000 1.000000 0.210434
C 0.612023 0.210434 1.000000
# All pairs non-significant (consistent with KW p=0.177)
Code
stat_kw, p_kw = stats.kruskal(model_a, model_b, model_c)
print(f"Kruskal-Wallis: H={stat_kw:.4f}, p={p_kw:.4f}")Kruskal-Wallis: H=3.4596, p=0.1772
p=0.177 → Fail to reject H₀. With n=6 per group, the test lacks power to detect the real but small differences.
Friedman Test
Replaces: repeated-measures ANOVA.
When: the SAME subjects (folds) are measured under all k conditions. One-way Kruskal-Wallis ignores the within-block correlation; Friedman exploits it.
Algorithm:
- For each block (fold), rank the k observations from 1 to k
- R̄ⱼ = mean rank of treatment j across all n blocks
- Q = (12n / (k(k+1))) × Σⱼ (R̄ⱼ − (k+1)/2)²
Under H₀, Q ~ χ²(k−1) approximately.
Step-by-Step (Folds as Blocks)
| Fold | A | B | C | Rank A | Rank B | Rank C |
|---|---|---|---|---|---|---|
| 1 | 0.82 | 0.84 | 0.78 | 2 | 3 | 1 |
| 2 | 0.79 | 0.81 | 0.75 | 2 | 3 | 1 |
| 3 | 0.91 | 0.93 | 0.87 | 2 | 3 | 1 |
| 4 | 0.85 | 0.87 | 0.81 | 2 | 3 | 1 |
| 5 | 0.78 | 0.80 | 0.74 | 2 | 3 | 1 |
| 6 | 0.88 | 0.90 | 0.84 | 2 | 3 | 1 |
Every fold gives the same ranking: C < A < B — a perfect pattern.
R̄_A = 12/6 = 2.0, R̄_B = 18/6 = 3.0, R̄_C = 6/6 = 1.0 (k+1)/2 = 2.0
Q = (12×6 / (3×4)) × [(2.0−2.0)² + (3.0−2.0)² + (1.0−2.0)²] = (72/12) × [0 + 1 + 1] = 6 × 2 = 12.000
df=2. χ²(0.05,2)=5.991. Q=12 >> 5.991 → p=0.002. Reject H₀.
The Friedman test is significant where Kruskal-Wallis was not — because Friedman removes the fold-to-fold variability (the between-block variance) from the error term. The consistent C < A < B ranking across every fold provides strong evidence of a real ordering.
Code
stat_fr, p_fr = stats.friedmanchisquare(model_a, model_b, model_c)
print(f"Friedman: Q={stat_fr:.4f}, p={p_fr:.4f}")Friedman: Q=12.0000, p=0.0025
Spearman's Rank Correlation
Replaces: Pearson r when data is non-normal or ordinal.
Spearman ρ is Pearson r computed on the ranks of the values, not the values themselves. It measures monotonic association rather than linear association.
For model_a and model_b, each fold in model_b scores exactly 0.02 above model_a, so their rank orderings are identical — a perfect monotonic relationship.
r_s, p_s = stats.spearmanr(model_a, model_b)
print(f"Spearman r: {r_s:.4f}, p={p_s:.4f}")Spearman r: 1.0000, p=0.0000
Full Spearman treatment and its relationship to Pearson r is in the correlation post of this series.
Full Code (All Tests)
import numpy as np
from scipy import stats
accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
model_a = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
model_b = np.array([0.84, 0.81, 0.93, 0.87, 0.80, 0.90])
model_c = np.array([0.78, 0.75, 0.87, 0.81, 0.74, 0.84])
# One-sample: median > 0.80?
w_stat, w_p = stats.wilcoxon(accuracy - 0.80, alternative='greater')
print(f"Wilcoxon signed-rank: W={w_stat:.1f}, p={w_p:.4f}")
# Two-sample
u_stat, u_p = stats.mannwhitneyu(model_a, model_b, alternative='two-sided')
print(f"Mann-Whitney U: U={u_stat:.1f}, p={u_p:.4f}")
# k-sample
h_stat, h_p = stats.kruskal(model_a, model_b, model_c)
print(f"Kruskal-Wallis: H={h_stat:.4f}, p={h_p:.4f}")
# Repeated measures (blocks = folds)
fr_stat, fr_p = stats.friedmanchisquare(model_a, model_b, model_c)
print(f"Friedman: Q={fr_stat:.4f}, p={fr_p:.4f}")
# Spearman correlation
r_s, p_s = stats.spearmanr(model_a, model_b)
print(f"Spearman r: {r_s:.4f}, p={p_s:.4f}")Wilcoxon signed-rank: W=17.5, p=0.0469
Mann-Whitney U: U=22.0, p=0.3941
Kruskal-Wallis: H=3.4596, p=0.1772
Friedman: Q=12.0000, p=0.0025
Spearman r: 1.0000, p=0.0000
Parametric vs Non-Parametric Equivalents
| Situation | Parametric Test | Non-Parametric Equivalent |
|---|---|---|
| One sample vs constant | One-sample t-test | Wilcoxon signed-rank |
| Two paired groups | Paired t-test | Wilcoxon signed-rank |
| Two independent groups | Welch's t-test | Mann-Whitney U |
| k independent groups | One-way ANOVA | Kruskal-Wallis |
| k related groups | Repeated-measures ANOVA | Friedman test |
| Correlation | Pearson r | Spearman ρ |
Test Your Understanding
-
Shapiro-Wilk returns p=0.12 on your accuracy data (n=6). A colleague says "just use a t-test, normality isn't rejected." A second colleague says "n=6 is too small, the Shapiro-Wilk test itself has low power for small samples — you can't rely on it to confirm normality." Who is right, and what should you do?
-
Your Mann-Whitney U test returns p=0.39 for n=6 vs n=6. Is this evidence that the two models perform equally? What would you need to establish equivalence rather than just "no significant difference"?
-
Kruskal-Wallis gives H=3.46, p=0.177 (not significant), but Friedman gives Q=12.0, p=0.003 (significant) on the exact same data and same three models. Explain why the Friedman test is more sensitive here. What variance is each test using in its denominator?
-
Model B consistently scores exactly 0.02 above Model A on every fold (perfect rank correlation). Does this mean Model B is practically better? What additional analysis would you need before recommending Model B in production?
-
You have 50 fold accuracy scores per model (n=50). Shapiro-Wilk returns p=0.04. A reviewer argues you should use non-parametric tests because normality was rejected. Construct the counter-argument based on the CLT, power considerations, and the cost of discarding magnitude information at n=50.