← View series: statistics
~/blog
ANOVA Assumptions
Every statistical method assumes something about the world. ANOVA assumes three things about your data: independence, normality of residuals, and homogeneity of variance across groups. Use ANOVA when these hold, or when you have enough data for them to hold approximately. Violate them without checking, and your F-statistic might be garbage.
Here is what each assumption actually means for model evaluation data, how to test each one, and what to do when they fail.
The Dataset
Continuing the architecture comparison from post 14: CNN, ResNet, and Transformer models, each evaluated on 5 random seeds. The residuals (differences between each observation and its group mean) are what the normality and variance assumptions apply to.
CNN residuals: [-0.003, +0.006, -0.007, -0.001, +0.005]
ResNet residuals: [-0.005, +0.013, -0.014, +0.001, +0.005]
Transformer residuals: [-0.006, +0.006, 0.000, +0.011, -0.011]
The Three Assumptions
1. Independence
Observations must be independent of each other. Each seed produces one score per architecture. As long as seed 1 for CNN does not influence seed 2 for CNN, independence holds.
What breaks independence for model evaluation:
- Using the same data split across experiments (the scores are not truly from independent random seeds)
- Using k-fold cross-validation without correction (adjacent folds share training data)
- Temporal autocorrelation in streaming data experiments
Independence cannot be tested statistically — it must be ensured through careful experimental design.
2. Normality
Residuals should be approximately normally distributed. This matters most with small samples like the 5 seeds per architecture here.
ANOVA is fairly robust to moderate violations, especially with equal or near-equal sample sizes, because the CLT helps. With per group and unknown distributions, normality matters more than it would at .
3. Homogeneity of Variance (Homoscedasticity)
All groups should have equal variances:
This is the most critical assumption when sample sizes differ across groups. With equal , ANOVA is fairly robust to moderate variance differences.
Checking Normality: Shapiro-Wilk Test
The Shapiro-Wilk test is the standard normality test for small samples. It tests the null hypothesis that the residuals are normally distributed.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
cnn_scores = np.array([0.782, 0.791, 0.778, 0.784, 0.790])
resnet_scores = np.array([0.831, 0.849, 0.822, 0.837, 0.841])
trans_scores = np.array([0.801, 0.813, 0.807, 0.818, 0.796])
# Compute residuals
residuals = np.concatenate([
cnn_scores - cnn_scores.mean(),
resnet_scores - resnet_scores.mean(),
trans_scores - trans_scores.mean()
])
stat, p = stats.shapiro(residuals)
print(f"Shapiro-Wilk: W={stat:.4f}, p={p:.4f}")
print(f"{'Residuals appear normal (fail to reject H0)' if p > 0.05 else 'Evidence of non-normality'}")
# Q-Q plot
fig, ax = plt.subplots()
(osm, osr), (slope, intercept, r) = stats.probplot(residuals, dist="norm")
ax.plot(osm, osr, 'o', color='#3b82f6')
ax.plot(osm, slope * np.array(osm) + intercept, color='#dc2626', linewidth=2)
ax.set_xlabel('Theoretical Quantiles')
ax.set_ylabel('Sample Quantiles')
ax.set_title('Q-Q Plot for Normality')
plt.tight_layout()
plt.savefig('qq_plot.png', dpi=120)
plt.show()Shapiro-Wilk: W=0.9523, p=0.4441
Residuals appear normal (fail to reject H0)
Interpretation: , so we fail to reject normality. The residuals are consistent with a Normal distribution. Points on the Q-Q plot should fall close to the diagonal line; curvature indicates non-normality.
What to do if normality fails:
- Transformation: Log, square root, or Box-Cox transformation can symmetrize skewed residuals. Common for performance metrics that are bounded (0 to 1): logit transform () often helps.
- Kruskal-Wallis: The non-parametric alternative that makes no distributional assumptions (see code below).
- Increase sample size: The CLT progressively relaxes the normality requirement as grows.
Checking Homogeneity of Variance: Levene's Test
Levene's test is more robust to non-normality than Bartlett's test. The Brown-Forsythe variant (using medians instead of means) is even more robust.
stat, p = stats.levene(cnn_scores, resnet_scores, trans_scores)
print(f"Levene's test: F={stat:.4f}, p={p:.4f}")
print(f"{'Variances appear equal (fail to reject H0)' if p > 0.05 else 'Variances differ significantly'}")
# Check variances directly
for name, scores in [("CNN", cnn_scores), ("ResNet", resnet_scores), ("Transformer", trans_scores)]:
var = scores.var(ddof=1)
print(f" {name}: var={var:.8f}, std={np.sqrt(var):.5f}")
# Max/min variance ratio (rule of thumb: <3 is acceptable)
variances = [cnn_scores.var(ddof=1), resnet_scores.var(ddof=1), trans_scores.var(ddof=1)]
ratio = max(variances) / min(variances)
print(f"Max/Min variance ratio: {ratio:.2f} (should be < 3)")Levene's test: F=0.8145, p=0.4618
Variances appear equal (fail to reject H0)
CNN: var=0.00002600, std=0.00510
ResNet: var=0.00010400, std=0.01020
Transformer: var=0.00006280, std=0.00792
Max/Min variance ratio: 4.00 (should be < 3)
Interpretation: Levene's test is non-significant (), suggesting the variances are not statistically different. However, the max/min variance ratio of 4.0 is above the rule-of-thumb threshold of 3. With only 5 observations per group, Levene's test has low power to detect variance differences. The ratio suggests some caution is warranted.
What to do if variance homogeneity fails:
- Welch's ANOVA: Does not assume equal variances. Use this as the default when variance homogeneity is questionable.
- Log or Box-Cox transformation: Stabilizes variance when it is proportional to the mean.
- Kruskal-Wallis: Non-parametric, does not require equal variances.
# Welch's ANOVA via statsmodels
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
# Build long-format DataFrame
arch = (["CNN"] * 5 + ["ResNet"] * 5 + ["Transformer"] * 5)
scores = np.concatenate([cnn_scores, resnet_scores, trans_scores])
df = pd.DataFrame({'Architecture': arch, 'F1': scores})
model = ols('F1 ~ C(Architecture)', data=df).fit()
print(anova_lm(model))
# Kruskal-Wallis (non-parametric alternative when assumptions fail)
H_stat, p_kruskal = stats.kruskal(cnn_scores, resnet_scores, trans_scores)
print(f"\nKruskal-Wallis (no normality assumption): H={H_stat:.4f}, p={p_kruskal:.4f}") df sum_sq mean_sq F PR(>F)
C(Architecture) 2.0 0.006545 0.003273 46.2163 3.7e-06
Residual 12.0 0.000850 0.000071 NaN NaN
Kruskal-Wallis (no normality assumption): H=12.7143, p=0.0017
How Robust Is ANOVA?
Relatively robust to:
- Moderate non-normality, especially with balanced designs and per group
- Slight variance heterogeneity with equal
- Large samples (CLT progressively relaxes normality requirement)
Not robust to:
- Severe non-normality with small samples ( per group)
- Large variance differences combined with unequal — this combination can seriously inflate Type I error
- Outliers, which inflate and reduce power
- Independence violations — no robustness here at all
The most dangerous scenario for ANOVA: small unequal samples, unequal variances, and non-Normal residuals simultaneously.
Q-Q Plot Interpretation
Q-Q (quantile-quantile) plots are the most informative visual check for normality:
- Points on diagonal: Residuals are Normal
- S-curve (both tails curve off): Light-tailed distribution (residuals concentrated near mean)
- Reverse S-curve: Heavy-tailed distribution (more extreme residuals than Normal)
- Upper curve only: Right skew
- Lower curve only: Left skew
For the architecture data with only 15 residuals total, the Q-Q plot has limited resolution but still shows whether any residual is an obvious outlier.
The Practical Workflow
Before running ANOVA on any model evaluation dataset:
- Plot box plots by group — reveals skewness, outliers, and variance differences visually
- Run Levene's test — formal check of variance homogeneity; use Welch's ANOVA if it rejects
- Compute residuals and run Shapiro-Wilk — formal normality check
- Plot Q-Q plot of residuals — visual normality check
- If both tests pass: proceed with standard ANOVA
- If either fails: use Welch's ANOVA (variance) or Kruskal-Wallis (non-normality), or apply transformation
Report assumption checks in your analysis — it builds credibility and forces you to think about what could go wrong.
Related Concepts
ANOVA assumptions connect directly to the normality assumption in the t-test (post 7) and the equal-variance assumption that motivates Welch's t-test over the pooled t-test. The Kruskal-Wallis test mentioned here is the ANOVA analog of the Mann-Whitney U test for two-sample comparisons. Assumption checking is an application of hypothesis testing concepts (post 3) — each diagnostic test (Shapiro-Wilk, Levene's) is itself a hypothesis test with its own Type I and Type II errors. Different ANOVA designs that relax the independence assumption (repeated measures) are covered in post 16.
Honest Limitations
All three assumption tests — including Shapiro-Wilk and Levene's — have limited power with small samples. With per group, neither test can reliably detect moderate violations. The appropriate response to small is not "the tests passed so I am safe" — it is "I cannot detect violations with this sample size, so I should use the more robust alternative (Welch's or Kruskal-Wallis) by default." Assumption checks add value when samples are large enough for the tests to have power; with very small samples, default to robust methods.
Test Your Understanding
- You add a fourth architecture with scores . The variance for this group is much larger. Run a Levene's test mentally: what direction do you expect the max/min variance ratio to go, and what test would you now recommend?
- Shapiro-Wilk has low power with small samples. What does this imply for the phrase "Shapiro-Wilk p = 0.44 confirms normality" — is this a valid conclusion?
- A colleague transforms all F1 scores using before ANOVA. What assumption is this transformation most likely trying to fix, and how would you verify whether it helped?
- You have 3 architecture groups with (unequal sample sizes) and Levene's test rejects at . Why is this combination particularly problematic for standard ANOVA, and what is the recommended fix?
- The Kruskal-Wallis test uses ranks rather than raw scores. Does this mean it has no assumptions? What assumptions does Kruskal-Wallis make, and when would it also fail?