Non-Parametric Tests

StatisticsMathData Science

You run Shapiro-Wilk on your six CV fold scores and get p=0.043. The normality assumption just failed. Your dataset is too small for the Central Limit Theorem to cover you. Every parametric test you've seen so far — t-test, z-test, ANOVA — assumes normality in the underlying distribution or a large enough sample to invoke the CLT. Non-parametric tests replace that assumption with a weaker one: they work on the ranks of the data rather than the raw values. Fewer assumptions, but at a cost.

Decision Framework

Non-parametric tests are the fallback when parametric assumptions fail — not the default.

Assumption check protocol (in order):

Shapiro-Wilk test: if p < 0.05, reject normality
Sample size: if n < 30, CLT doesn't rescue a non-normal distribution
Outliers: if severe outliers exist, the t-test mean is unreliable
Scale: if data is ordinal (not interval/ratio), ranks are more meaningful than means

The cost of non-parametric tests: they discard magnitude information by replacing raw values with ranks. When the parametric assumptions hold, this costs you statistical power — you need a larger sample to detect the same effect. A non-parametric test is not automatically "safer"; it just trades one set of assumptions for another.

Anchor Datasets

python

import numpy as np
from scipy import stats

# One-sample: does median accuracy exceed 0.80?
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

# Two-sample: is Model B better than Model A?
model_a = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
model_b = [0.84, 0.81, 0.93, 0.87, 0.80, 0.90]

# k-sample: do three models differ?
model_c = [0.78, 0.75, 0.87, 0.81, 0.74, 0.84]

Wilcoxon Signed-Rank Test

Replaces: one-sample t-test or paired t-test.

H₀: the population median = η₀. H₁: median > η₀ (or ≠ η₀ two-sided).

Test question: does the median CV fold accuracy significantly exceed 0.80?

Algorithm

Compute differences dᵢ = xᵢ − η₀
Discard differences = 0 (reduce n accordingly)
Rank |dᵢ| from smallest to largest; ties get average ranks
W⁺ = sum of ranks of positive dᵢ; W⁻ = sum of ranks of negative dᵢ
Under H₀: W⁺ and W⁻ should be roughly equal (both ≈ n(n+1)/4)

Step-by-Step on the Anchor (η₀ = 0.80)

| Fold | xᵢ | dᵢ = xᵢ − 0.80 | |dᵢ| | Rank | Signed Rank | |------|----|-----------------|------|------|-------------| | 1 | 0.82 | +0.02 | 0.02 | 2.5 | +2.5 | | 2 | 0.79 | −0.01 | 0.01 | 1 | −1 | | 3 | 0.91 | +0.11 | 0.11 | 6 | +6 | | 4 | 0.85 | +0.05 | 0.05 | 4 | +4 | | 5 | 0.78 | −0.02 | 0.02 | 2.5 | −2.5 | | 6 | 0.88 | +0.08 | 0.08 | 5 | +5 |

Ranks 2 and 3 are tied at |d|=0.02 → average rank = 2.5 assigned to both.

W⁺ = 2.5 + 6 + 4 + 5 = 17.5 (positive differences dominate) W⁻ = 1 + 2.5 = 3.5

Check: W⁺ + W⁻ = 21 = n(n+1)/2 = 6×7/2 ✓

Under H₀: E[W⁺] = n(n+1)/4 = 10.5. Our W⁺=17.5 is well above that — evidence the median exceeds 0.80.

Code

python

import numpy as np
from scipy import stats

accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
stat, p = stats.wilcoxon(accuracy - 0.80, alternative='greater')
print(f"Wilcoxon signed-rank: W={stat:.1f}, p={p:.4f}")

Wilcoxon signed-rank: W=17.5, p=0.0469

p=0.047 < 0.05 → Reject H₀. The median fold accuracy significantly exceeds 0.80 at the 5% level.

Mann-Whitney U Test

Replaces: independent two-sample t-test (Welch's).

H₀: P(X_A > X_B) = 0.5. H₁: P(X_A > X_B) ≠ 0.5 (two-sided).

Important misconception: Mann-Whitney tests whether one distribution tends to produce larger values than the other — not whether the medians are equal. The median-equality interpretation is valid only when both distributions have the same shape. In practice (different shapes or scales), these hypotheses are different.

Algorithm

Combine all n_A + n_B observations and rank from smallest to largest
R_A = sum of ranks assigned to Group A observations
U_A = n_A × n_B + n_A(n_A+1)/2 − R_A
U_B = n_A × n_B − U_A
Test statistic: U = min(U_A, U_B) for two-sided test

Step-by-Step on the Anchor

All 12 values sorted with group labels:

Rank	Value	Group	Rank	Value	Group
1	0.78	A	7	0.85	A
2	0.79	A	8	0.87	B
3	0.80	B	9	0.88	A
4	0.81	B	10	0.90	B
5	0.82	A	11	0.91	A
6	0.84	B	12	0.93	B

R_A = 1+2+5+7+9+11 = 35 R_B = 3+4+6+8+10+12 = 43

U_A = 6×6 + 6×7/2 − 35 = 36 + 21 − 35 = 22 U_B = 36 − 22 = 14 U = min(22, 14) = 14

Effect size: r = Z / √(n_A + n_B). Using the normal approximation of U: E[U] = n_A n_B / 2 = 18, SD[U] = √(n_A n_B (n_A+n_B+1)/12) = √39 = 6.245 Z = (U_A − 18) / 6.245 = 4 / 6.245 = 0.640 r = 0.640 / √12 = 0.185 (small effect)

Code

python

stat_u, p_mw = stats.mannwhitneyu(model_a, model_b, alternative='two-sided')
print(f"Mann-Whitney U: U={stat_u:.1f}, p={p_mw:.4f}")

Mann-Whitney U: U=22.0, p=0.3941

p=0.394 → Fail to reject H₀. The two models do not show a statistically significant difference in fold scores at n=6.

Kruskal-Wallis Test

Replaces: one-way ANOVA.

H₀: all k populations have the same distribution. H₁: at least one differs.

The analogy to ANOVA: H is the rank-based analogue of the F-statistic — it measures how much the group mean ranks deviate from the overall mean rank, weighted by group size.

Formula

H = (12 / (N(N+1))) × Σᵢ nᵢ (R̄ᵢ − R̄)²

where R̄ = (N+1)/2 is the overall mean rank and R̄ᵢ is the mean rank of group i. Under H₀, H ~ χ²(k−1) approximately (for nᵢ ≥ 5).

Step-by-Step on All Three Models (N=18)

All 18 values sorted with ties handled by average ranks:

Rank	Value	Group	Rank	Value	Group
1	0.74	C	10.5	0.84	B
2	0.75	C	10.5	0.84	C
3.5	0.78	A	12	0.85	A
3.5	0.78	C	13.5	0.87	B
5	0.79	A	13.5	0.87	C
6	0.80	B	15	0.88	A
7.5	0.81	B	16	0.90	B
7.5	0.81	C	17	0.91	A
9	0.82	A	18	0.93	B

R̄_A = (3.5+5+9+12+15+17)/6 = 61.5/6 = 10.25 R̄_B = (6+7.5+10.5+13.5+16+18)/6 = 71.5/6 = 11.92 R̄_C = (1+2+3.5+7.5+10.5+13.5)/6 = 38/6 = 6.33 Overall R̄ = (18+1)/2 = 9.5

H = (12/(18×19)) × [6×(10.25−9.5)² + 6×(11.92−9.5)² + 6×(6.33−9.5)²] = 0.03509 × [6×0.5625 + 6×5.857 + 6×10.049] = 0.03509 × [3.375 + 35.14 + 60.29] = 0.03509 × 98.81 = 3.468

df = k − 1 = 2. χ²(0.05, 2) = 5.991. Since H=3.468 < 5.991 → fail to reject.

Post-Hoc: Dunn's Test

Post-hoc pairwise comparisons after a significant Kruskal-Wallis use Dunn's test — pairwise Mann-Whitney with multiple-comparison correction.

python

from scikit_posthocs import posthoc_dunn

data = model_a + model_b + model_c
groups = ['A']*6 + ['B']*6 + ['C']*6
result = posthoc_dunn(data, group_col=groups, p_adjust='bonferroni')
print(result)

# Bonferroni-corrected pairwise p-values (run only after significant KW):
          A         B         C
A  1.000000  1.000000  0.612023
B  1.000000  1.000000  0.210434
C  0.612023  0.210434  1.000000
# All pairs non-significant (consistent with KW p=0.177)

Code

python

stat_kw, p_kw = stats.kruskal(model_a, model_b, model_c)
print(f"Kruskal-Wallis: H={stat_kw:.4f}, p={p_kw:.4f}")

Kruskal-Wallis: H=3.4596, p=0.1772

p=0.177 → Fail to reject H₀. With n=6 per group, the test lacks power to detect the real but small differences.

Friedman Test

Replaces: repeated-measures ANOVA.

When: the SAME subjects (folds) are measured under all k conditions. One-way Kruskal-Wallis ignores the within-block correlation; Friedman exploits it.

Algorithm:

For each block (fold), rank the k observations from 1 to k
R̄ⱼ = mean rank of treatment j across all n blocks
Q = (12n / (k(k+1))) × Σⱼ (R̄ⱼ − (k+1)/2)²

Under H₀, Q ~ χ²(k−1) approximately.

Step-by-Step (Folds as Blocks)

Fold	A	B	C	Rank A	Rank B	Rank C
1	0.82	0.84	0.78	2	3	1
2	0.79	0.81	0.75	2	3	1
3	0.91	0.93	0.87	2	3	1
4	0.85	0.87	0.81	2	3	1
5	0.78	0.80	0.74	2	3	1
6	0.88	0.90	0.84	2	3	1

Every fold gives the same ranking: C < A < B — a perfect pattern.

R̄_A = 12/6 = 2.0, R̄_B = 18/6 = 3.0, R̄_C = 6/6 = 1.0 (k+1)/2 = 2.0

Q = (12×6 / (3×4)) × [(2.0−2.0)² + (3.0−2.0)² + (1.0−2.0)²] = (72/12) × [0 + 1 + 1] = 6 × 2 = 12.000

df=2. χ²(0.05,2)=5.991. Q=12 >> 5.991 → p=0.002. Reject H₀.

The Friedman test is significant where Kruskal-Wallis was not — because Friedman removes the fold-to-fold variability (the between-block variance) from the error term. The consistent C < A < B ranking across every fold provides strong evidence of a real ordering.

Code

python

stat_fr, p_fr = stats.friedmanchisquare(model_a, model_b, model_c)
print(f"Friedman: Q={stat_fr:.4f}, p={p_fr:.4f}")

Friedman: Q=12.0000, p=0.0025

Spearman's Rank Correlation

Replaces: Pearson r when data is non-normal or ordinal.

Spearman ρ is Pearson r computed on the ranks of the values, not the values themselves. It measures monotonic association rather than linear association.

For model_a and model_b, each fold in model_b scores exactly 0.02 above model_a, so their rank orderings are identical — a perfect monotonic relationship.

python

r_s, p_s = stats.spearmanr(model_a, model_b)
print(f"Spearman r: {r_s:.4f}, p={p_s:.4f}")

Spearman r: 1.0000, p=0.0000

Full Spearman treatment and its relationship to Pearson r is in the correlation post of this series.

Full Code (All Tests)

python

import numpy as np
from scipy import stats

accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
model_a  = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
model_b  = np.array([0.84, 0.81, 0.93, 0.87, 0.80, 0.90])
model_c  = np.array([0.78, 0.75, 0.87, 0.81, 0.74, 0.84])

# One-sample: median > 0.80?
w_stat, w_p = stats.wilcoxon(accuracy - 0.80, alternative='greater')
print(f"Wilcoxon signed-rank: W={w_stat:.1f}, p={w_p:.4f}")

# Two-sample
u_stat, u_p = stats.mannwhitneyu(model_a, model_b, alternative='two-sided')
print(f"Mann-Whitney U: U={u_stat:.1f}, p={u_p:.4f}")

# k-sample
h_stat, h_p = stats.kruskal(model_a, model_b, model_c)
print(f"Kruskal-Wallis: H={h_stat:.4f}, p={h_p:.4f}")

# Repeated measures (blocks = folds)
fr_stat, fr_p = stats.friedmanchisquare(model_a, model_b, model_c)
print(f"Friedman: Q={fr_stat:.4f}, p={fr_p:.4f}")

# Spearman correlation
r_s, p_s = stats.spearmanr(model_a, model_b)
print(f"Spearman r: {r_s:.4f}, p={p_s:.4f}")

Wilcoxon signed-rank: W=17.5, p=0.0469
Mann-Whitney U: U=22.0, p=0.3941
Kruskal-Wallis: H=3.4596, p=0.1772
Friedman: Q=12.0000, p=0.0025
Spearman r: 1.0000, p=0.0000

Parametric vs Non-Parametric Equivalents

Situation	Parametric Test	Non-Parametric Equivalent
One sample vs constant	One-sample t-test	Wilcoxon signed-rank
Two paired groups	Paired t-test	Wilcoxon signed-rank
Two independent groups	Welch's t-test	Mann-Whitney U
k independent groups	One-way ANOVA	Kruskal-Wallis
k related groups	Repeated-measures ANOVA	Friedman test
Correlation	Pearson r	Spearman ρ

Test Your Understanding

Shapiro-Wilk returns p=0.12 on your accuracy data (n=6). A colleague says "just use a t-test, normality isn't rejected." A second colleague says "n=6 is too small, the Shapiro-Wilk test itself has low power for small samples — you can't rely on it to confirm normality." Who is right, and what should you do?
Your Mann-Whitney U test returns p=0.39 for n=6 vs n=6. Is this evidence that the two models perform equally? What would you need to establish equivalence rather than just "no significant difference"?
Kruskal-Wallis gives H=3.46, p=0.177 (not significant), but Friedman gives Q=12.0, p=0.003 (significant) on the exact same data and same three models. Explain why the Friedman test is more sensitive here. What variance is each test using in its denominator?
Model B consistently scores exactly 0.02 above Model A on every fold (perfect rank correlation). Does this mean Model B is practically better? What additional analysis would you need before recommending Model B in production?
You have 50 fold accuracy scores per model (n=50). Shapiro-Wilk returns p=0.04. A reviewer argues you should use non-parametric tests because normality was rejected. Construct the counter-argument based on the CLT, power considerations, and the cost of discarding magnitude information at n=50.

Non-Parametric Tests

Decision Framework

Anchor Datasets

Wilcoxon Signed-Rank Test

Algorithm

Step-by-Step on the Anchor (η₀ = 0.80)

Code

Mann-Whitney U Test

Algorithm

Step-by-Step on the Anchor

Code

Kruskal-Wallis Test

Formula

Step-by-Step on All Three Models (N=18)

Post-Hoc: Dunn's Test

Code

Friedman Test

Step-by-Step (Folds as Blocks)

Code

Spearman's Rank Correlation

Full Code (All Tests)

Parametric vs Non-Parametric Equivalents

Test Your Understanding

Comments (0)

Leave a comment