Back to blog
← View series: statistics

~/blog

Non-Parametric Tests

Apr 14, 202613 min readBy Mohammed Vasim
StatisticsMathData Science

You run Shapiro-Wilk on your six CV fold scores and get p=0.043. The normality assumption just failed. Your dataset is too small for the Central Limit Theorem to cover you. Every parametric test you've seen so far — t-test, z-test, ANOVA — assumes normality in the underlying distribution or a large enough sample to invoke the CLT. Non-parametric tests replace that assumption with a weaker one: they work on the ranks of the data rather than the raw values. Fewer assumptions, but at a cost.

Decision Framework

Non-parametric tests are the fallback when parametric assumptions fail — not the default.

Assumption check protocol (in order):

  1. Shapiro-Wilk test: if p < 0.05, reject normality
  2. Sample size: if n < 30, CLT doesn't rescue a non-normal distribution
  3. Outliers: if severe outliers exist, the t-test mean is unreliable
  4. Scale: if data is ordinal (not interval/ratio), ranks are more meaningful than means
Decision tree: parametric vs non-parametric Shapiro-Wilk: normal? YES p≥0.05 NO p<0.05 n ≥ 30? (CLT) YES Parametric OK NO Non-Parametric Non-Normal + small n Non-Parametric Cost: non-parametric tests have lower power when parametric assumptions DO hold — they are not always the "safer" choice

The cost of non-parametric tests: they discard magnitude information by replacing raw values with ranks. When the parametric assumptions hold, this costs you statistical power — you need a larger sample to detect the same effect. A non-parametric test is not automatically "safer"; it just trades one set of assumptions for another.

Anchor Datasets

python
import numpy as np
from scipy import stats

# One-sample: does median accuracy exceed 0.80?
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

# Two-sample: is Model B better than Model A?
model_a = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
model_b = [0.84, 0.81, 0.93, 0.87, 0.80, 0.90]

# k-sample: do three models differ?
model_c = [0.78, 0.75, 0.87, 0.81, 0.74, 0.84]

Wilcoxon Signed-Rank Test

Replaces: one-sample t-test or paired t-test.

H₀: the population median = η₀. H₁: median > η₀ (or ≠ η₀ two-sided).

Test question: does the median CV fold accuracy significantly exceed 0.80?

Algorithm

  1. Compute differences dᵢ = xᵢ − η₀
  2. Discard differences = 0 (reduce n accordingly)
  3. Rank |dᵢ| from smallest to largest; ties get average ranks
  4. W⁺ = sum of ranks of positive dᵢ; W⁻ = sum of ranks of negative dᵢ
  5. Under H₀: W⁺ and W⁻ should be roughly equal (both ≈ n(n+1)/4)

Step-by-Step on the Anchor (η₀ = 0.80)

| Fold | xᵢ | dᵢ = xᵢ − 0.80 | |dᵢ| | Rank | Signed Rank | |------|----|-----------------|------|------|-------------| | 1 | 0.82 | +0.02 | 0.02 | 2.5 | +2.5 | | 2 | 0.79 | −0.01 | 0.01 | 1 | −1 | | 3 | 0.91 | +0.11 | 0.11 | 6 | +6 | | 4 | 0.85 | +0.05 | 0.05 | 4 | +4 | | 5 | 0.78 | −0.02 | 0.02 | 2.5 | −2.5 | | 6 | 0.88 | +0.08 | 0.08 | 5 | +5 |

Ranks 2 and 3 are tied at |d|=0.02 → average rank = 2.5 assigned to both.

W⁺ = 2.5 + 6 + 4 + 5 = 17.5 (positive differences dominate) W⁻ = 1 + 2.5 = 3.5

Check: W⁺ + W⁻ = 21 = n(n+1)/2 = 6×7/2 ✓

Under H₀: E[W⁺] = n(n+1)/4 = 10.5. Our W⁺=17.5 is well above that — evidence the median exceeds 0.80.

Wilcoxon signed-rank: differences on a number line 0 -0.02 +0.05 +0.11 −1 −2.5 +2.5 +4 +5 +6 W⁺ = 2.5+4+5+6 = 17.5 W⁻ = 1+2.5 = 3.5

Code

python
import numpy as np
from scipy import stats

accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
stat, p = stats.wilcoxon(accuracy - 0.80, alternative='greater')
print(f"Wilcoxon signed-rank: W={stat:.1f}, p={p:.4f}")
Wilcoxon signed-rank: W=17.5, p=0.0469

p=0.047 < 0.05 → Reject H₀. The median fold accuracy significantly exceeds 0.80 at the 5% level.

Mann-Whitney U Test

Replaces: independent two-sample t-test (Welch's).

H₀: P(X_A > X_B) = 0.5. H₁: P(X_A > X_B) ≠ 0.5 (two-sided).

Important misconception: Mann-Whitney tests whether one distribution tends to produce larger values than the other — not whether the medians are equal. The median-equality interpretation is valid only when both distributions have the same shape. In practice (different shapes or scales), these hypotheses are different.

Algorithm

  1. Combine all n_A + n_B observations and rank from smallest to largest
  2. R_A = sum of ranks assigned to Group A observations
  3. U_A = n_A × n_B + n_A(n_A+1)/2 − R_A
  4. U_B = n_A × n_B − U_A
  5. Test statistic: U = min(U_A, U_B) for two-sided test

Step-by-Step on the Anchor

All 12 values sorted with group labels:

RankValueGroupRankValueGroup
10.78A70.85A
20.79A80.87B
30.80B90.88A
40.81B100.90B
50.82A110.91A
60.84B120.93B

R_A = 1+2+5+7+9+11 = 35 R_B = 3+4+6+8+10+12 = 43

U_A = 6×6 + 6×7/2 − 35 = 36 + 21 − 35 = 22 U_B = 36 − 22 = 14 U = min(22, 14) = 14

Mann-Whitney: combined rank plot (A=blue, B=green) 1 2 5 7 9 11 3 4 6 8 10 12 R_A = 1+2+5+7+9+11 = 35 → U_A = 22 R_B = 3+4+6+8+10+12 = 43 → U_B = 14

Effect size: r = Z / √(n_A + n_B). Using the normal approximation of U: E[U] = n_A n_B / 2 = 18, SD[U] = √(n_A n_B (n_A+n_B+1)/12) = √39 = 6.245 Z = (U_A − 18) / 6.245 = 4 / 6.245 = 0.640 r = 0.640 / √12 = 0.185 (small effect)

Code

python
stat_u, p_mw = stats.mannwhitneyu(model_a, model_b, alternative='two-sided')
print(f"Mann-Whitney U: U={stat_u:.1f}, p={p_mw:.4f}")
Mann-Whitney U: U=22.0, p=0.3941

p=0.394 → Fail to reject H₀. The two models do not show a statistically significant difference in fold scores at n=6.

Kruskal-Wallis Test

Replaces: one-way ANOVA.

H₀: all k populations have the same distribution. H₁: at least one differs.

The analogy to ANOVA: H is the rank-based analogue of the F-statistic — it measures how much the group mean ranks deviate from the overall mean rank, weighted by group size.

Formula

H = (12 / (N(N+1))) × Σᵢ nᵢ (R̄ᵢ − R̄)²

where R̄ = (N+1)/2 is the overall mean rank and R̄ᵢ is the mean rank of group i. Under H₀, H ~ χ²(k−1) approximately (for nᵢ ≥ 5).

Step-by-Step on All Three Models (N=18)

All 18 values sorted with ties handled by average ranks:

RankValueGroupRankValueGroup
10.74C10.50.84B
20.75C10.50.84C
3.50.78A120.85A
3.50.78C13.50.87B
50.79A13.50.87C
60.80B150.88A
7.50.81B160.90B
7.50.81C170.91A
90.82A180.93B

R̄_A = (3.5+5+9+12+15+17)/6 = 61.5/6 = 10.25 R̄_B = (6+7.5+10.5+13.5+16+18)/6 = 71.5/6 = 11.92 R̄_C = (1+2+3.5+7.5+10.5+13.5)/6 = 38/6 = 6.33 Overall R̄ = (18+1)/2 = 9.5

H = (12/(18×19)) × [6×(10.25−9.5)² + 6×(11.92−9.5)² + 6×(6.33−9.5)²] = 0.03509 × [6×0.5625 + 6×5.857 + 6×10.049] = 0.03509 × [3.375 + 35.14 + 60.29] = 0.03509 × 98.81 = 3.468

df = k − 1 = 2. χ²(0.05, 2) = 5.991. Since H=3.468 < 5.991 → fail to reject.

Post-Hoc: Dunn's Test

Post-hoc pairwise comparisons after a significant Kruskal-Wallis use Dunn's test — pairwise Mann-Whitney with multiple-comparison correction.

python
from scikit_posthocs import posthoc_dunn

data = model_a + model_b + model_c
groups = ['A']*6 + ['B']*6 + ['C']*6
result = posthoc_dunn(data, group_col=groups, p_adjust='bonferroni')
print(result)
# Bonferroni-corrected pairwise p-values (run only after significant KW): A B C A 1.000000 1.000000 0.612023 B 1.000000 1.000000 0.210434 C 0.612023 0.210434 1.000000 # All pairs non-significant (consistent with KW p=0.177)

Code

python
stat_kw, p_kw = stats.kruskal(model_a, model_b, model_c)
print(f"Kruskal-Wallis: H={stat_kw:.4f}, p={p_kw:.4f}")
Kruskal-Wallis: H=3.4596, p=0.1772

p=0.177 → Fail to reject H₀. With n=6 per group, the test lacks power to detect the real but small differences.

Friedman Test

Replaces: repeated-measures ANOVA.

When: the SAME subjects (folds) are measured under all k conditions. One-way Kruskal-Wallis ignores the within-block correlation; Friedman exploits it.

Algorithm:

  1. For each block (fold), rank the k observations from 1 to k
  2. R̄ⱼ = mean rank of treatment j across all n blocks
  3. Q = (12n / (k(k+1))) × Σⱼ (R̄ⱼ − (k+1)/2)²

Under H₀, Q ~ χ²(k−1) approximately.

Step-by-Step (Folds as Blocks)

FoldABCRank ARank BRank C
10.820.840.78231
20.790.810.75231
30.910.930.87231
40.850.870.81231
50.780.800.74231
60.880.900.84231

Every fold gives the same ranking: C < A < B — a perfect pattern.

R̄_A = 12/6 = 2.0, R̄_B = 18/6 = 3.0, R̄_C = 6/6 = 1.0 (k+1)/2 = 2.0

Q = (12×6 / (3×4)) × [(2.0−2.0)² + (3.0−2.0)² + (1.0−2.0)²] = (72/12) × [0 + 1 + 1] = 6 × 2 = 12.000

df=2. χ²(0.05,2)=5.991. Q=12 >> 5.991 → p=0.002. Reject H₀.

The Friedman test is significant where Kruskal-Wallis was not — because Friedman removes the fold-to-fold variability (the between-block variance) from the error term. The consistent C < A < B ranking across every fold provides strong evidence of a real ordering.

Code

python
stat_fr, p_fr = stats.friedmanchisquare(model_a, model_b, model_c)
print(f"Friedman: Q={stat_fr:.4f}, p={p_fr:.4f}")
Friedman: Q=12.0000, p=0.0025

Spearman's Rank Correlation

Replaces: Pearson r when data is non-normal or ordinal.

Spearman ρ is Pearson r computed on the ranks of the values, not the values themselves. It measures monotonic association rather than linear association.

For model_a and model_b, each fold in model_b scores exactly 0.02 above model_a, so their rank orderings are identical — a perfect monotonic relationship.

python
r_s, p_s = stats.spearmanr(model_a, model_b)
print(f"Spearman r: {r_s:.4f}, p={p_s:.4f}")
Spearman r: 1.0000, p=0.0000

Full Spearman treatment and its relationship to Pearson r is in the correlation post of this series.

Full Code (All Tests)

python
import numpy as np
from scipy import stats

accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
model_a  = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
model_b  = np.array([0.84, 0.81, 0.93, 0.87, 0.80, 0.90])
model_c  = np.array([0.78, 0.75, 0.87, 0.81, 0.74, 0.84])

# One-sample: median > 0.80?
w_stat, w_p = stats.wilcoxon(accuracy - 0.80, alternative='greater')
print(f"Wilcoxon signed-rank: W={w_stat:.1f}, p={w_p:.4f}")

# Two-sample
u_stat, u_p = stats.mannwhitneyu(model_a, model_b, alternative='two-sided')
print(f"Mann-Whitney U: U={u_stat:.1f}, p={u_p:.4f}")

# k-sample
h_stat, h_p = stats.kruskal(model_a, model_b, model_c)
print(f"Kruskal-Wallis: H={h_stat:.4f}, p={h_p:.4f}")

# Repeated measures (blocks = folds)
fr_stat, fr_p = stats.friedmanchisquare(model_a, model_b, model_c)
print(f"Friedman: Q={fr_stat:.4f}, p={fr_p:.4f}")

# Spearman correlation
r_s, p_s = stats.spearmanr(model_a, model_b)
print(f"Spearman r: {r_s:.4f}, p={p_s:.4f}")
Wilcoxon signed-rank: W=17.5, p=0.0469 Mann-Whitney U: U=22.0, p=0.3941 Kruskal-Wallis: H=3.4596, p=0.1772 Friedman: Q=12.0000, p=0.0025 Spearman r: 1.0000, p=0.0000

Parametric vs Non-Parametric Equivalents

SituationParametric TestNon-Parametric Equivalent
One sample vs constantOne-sample t-testWilcoxon signed-rank
Two paired groupsPaired t-testWilcoxon signed-rank
Two independent groupsWelch's t-testMann-Whitney U
k independent groupsOne-way ANOVAKruskal-Wallis
k related groupsRepeated-measures ANOVAFriedman test
CorrelationPearson rSpearman ρ

Test Your Understanding

  1. Shapiro-Wilk returns p=0.12 on your accuracy data (n=6). A colleague says "just use a t-test, normality isn't rejected." A second colleague says "n=6 is too small, the Shapiro-Wilk test itself has low power for small samples — you can't rely on it to confirm normality." Who is right, and what should you do?

  2. Your Mann-Whitney U test returns p=0.39 for n=6 vs n=6. Is this evidence that the two models perform equally? What would you need to establish equivalence rather than just "no significant difference"?

  3. Kruskal-Wallis gives H=3.46, p=0.177 (not significant), but Friedman gives Q=12.0, p=0.003 (significant) on the exact same data and same three models. Explain why the Friedman test is more sensitive here. What variance is each test using in its denominator?

  4. Model B consistently scores exactly 0.02 above Model A on every fold (perfect rank correlation). Does this mean Model B is practically better? What additional analysis would you need before recommending Model B in production?

  5. You have 50 fold accuracy scores per model (n=50). Shapiro-Wilk returns p=0.04. A reviewer argues you should use non-parametric tests because normality was rejected. Construct the counter-argument based on the CLT, power considerations, and the cost of discarding magnitude information at n=50.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment