~/blog

ANOVA Assumptions

Apr 11, 2026•12 min read•By Mohammed Vasim

StatisticsMathData Science

ANOVA uses the F-distribution to compute p-values. That distribution applies only when your data satisfies four conditions. Violate them and the true Type I error rate drifts from the stated α — you reject H₀ more or less often than intended, and the F-statistic no longer follows the F-distribution at all. Conclusions about model differences can be completely spurious.

Not all violations are equally damaging:

Severe: non-independence — a structural problem that cannot be fixed after the fact, only by redesign
Moderate: non-normality with small n — the F-distribution tails matter when CLT hasn't kicked in
Mild: unequal variances with equal group sizes — ANOVA is surprisingly robust here

Here is what each assumption means, how to test it, and what to do when it fails.

The Dataset

Three model variants — Model A, Model B, Model C — evaluated on 8 cross-validation folds each. The dependent variable is F1 score.

python

model_A = [0.821, 0.847, 0.835, 0.812, 0.859, 0.828, 0.841, 0.816]
model_B = [0.791, 0.803, 0.789, 0.812, 0.798, 0.785, 0.801, 0.794]
model_C = [0.863, 0.879, 0.855, 0.871, 0.868, 0.852, 0.875, 0.861]
# n=8 per group, k=3 groups, N=24 total

Assumption 1: Independence of Observations

Each observation must be independent of all others. Knowing the value of one F1 score gives no information about any other.

Non-independence inflates Type I error — you reject H₀ more often than α says. The most common violation in ML is pseudoreplication: treating one repeated measurement as if it were independent. Specific ways this happens:

Overlapping CV folds (the same training data appears in multiple folds)
Measuring the same model multiple times on the same test set
Time series data where consecutive observations are autocorrelated

Independence cannot be tested statistically. Ask instead: "Could the value of one observation influence another?" For the anchor, each fold uses a distinct data partition — the assumption holds by design.

If you suspect temporal autocorrelation in residuals, the Durbin-Watson test checks for it. A value near 2.0 signals no autocorrelation; values below 1 or above 3 are warning signs.

When independence is violated: redesign the study. If autocorrelation exists, repeated-measures ANOVA or linear mixed models are the correct tools. There is no post-hoc fix for non-independence.

Assumption 2: Normality Within Groups

Within each group, the observations should be approximately normally distributed. This is not a requirement on the overall distribution — only within-group distributions matter.

With large n (≥30 per group), CLT makes ANOVA robust. With n=8 per group as here, the F-distribution approximation can break down if data is severely non-normal.

The Shapiro-Wilk test is the standard check for small samples. It tests H₀: data is normally distributed. An important caveat: failing to reject does not prove normality — only that you lack evidence against it. With n=8, Shapiro-Wilk has low power.

When normality fails:

Kruskal-Wallis (non-parametric ANOVA equivalent): use when Shapiro-Wilk p < 0.05 for any group, especially with n < 15.
Log transformation: often makes right-skewed data more Normal.
Increase n: at n ≥ 30 per group, CLT makes the normality assumption largely moot.

Assumption 3: Homoscedasticity (Equal Variances)

All groups must have equal within-group variance: σ²_A = σ²_B = σ²_C.

ANOVA uses a pooled variance estimate (MS_within). If variances differ widely, the pooled estimate is wrong for each group — inflating or deflating the F-ratio. This matters most when group sizes are unequal; with equal n, ANOVA tolerates moderate variance differences.

Levene's test (H₀: all group variances equal) is the preferred check — robust to non-normality. Bartlett's test is sensitive to non-normality and should be avoided if normality is in doubt.

Rule of thumb: if max(s²) / min(s²) > 4, variances may be problematically unequal.

When homoscedasticity fails:

Welch's ANOVA: does not assume equal variances — the correct default when Levene's p < 0.05.
Brown-Forsythe test: another robust alternative using medians.
Robustness note: with equal n (as here, n=8 per group), ANOVA tolerates unequal variances moderately well. The dangerous combination is unequal n AND unequal variances — use Welch's ANOVA in that case.

Assumption 4: Interval or Ratio Scale Measurement

The dependent variable must be on an interval or ratio scale — not nominal or ordinal. ANOVA computes group means and variances; these operations require numeric data with equal intervals between values.

The most common ML violation: treating ordinal ratings (1–5 star ratings, severity levels) as continuous and running ANOVA. The arithmetic mean of "3 stars" and "5 stars" is not necessarily "4 stars" in any meaningful sense.

For the anchor: F1 score is a ratio scale (a score of 0.860 is genuinely twice 0.430 and the interval between 0.820 and 0.830 equals the interval between 0.850 and 0.860). This assumption is satisfied for most standard ML metrics: accuracy, precision, recall, F1, AUC.

When violated: use Kruskal-Wallis (ordinal dependent variable) or chi-square (nominal dependent variable), or ordinal regression.

Checking All Assumptions Programmatically

python

import numpy as np
from scipy import stats

model_A = [0.821, 0.847, 0.835, 0.812, 0.859, 0.828, 0.841, 0.816]
model_B = [0.791, 0.803, 0.789, 0.812, 0.798, 0.785, 0.801, 0.794]
model_C = [0.863, 0.879, 0.855, 0.871, 0.868, 0.852, 0.875, 0.861]
groups = [model_A, model_B, model_C]
names = ["Model A", "Model B", "Model C"]

print("=== Assumption 1: Independence ===")
print("  Cannot be tested statistically. Verify by design.")
print("  These are independent CV folds → assumption satisfied.")

print("\n=== Assumption 2: Normality (Shapiro-Wilk) ===")
for name, group in zip(names, groups):
    stat, p = stats.shapiro(group)
    status = "NOT VIOLATED" if p > 0.05 else "VIOLATED"
    print(f"  {name}: W={stat:.4f}, p={p:.4f} → {status}")

print("\n=== Assumption 3: Homoscedasticity ===")
stat_lev, p_lev = stats.levene(*groups)
print(f"  Levene's test: stat={stat_lev:.4f}, p={p_lev:.4f}")
print(f"  Variances: {[round(np.var(g, ddof=1), 6) for g in groups]}")
print(f"  Max/min variance ratio: {max(np.var(g,ddof=1) for g in groups)/min(np.var(g,ddof=1) for g in groups):.2f}")
status_lev = "NOT VIOLATED" if p_lev > 0.05 else "VIOLATED"
print(f"  Levene's: {status_lev}")

stat_bar, p_bar = stats.bartlett(*groups)
print(f"  Bartlett's test: stat={stat_bar:.4f}, p={p_bar:.4f} (sensitive to non-normality)")

print("\n=== Assumption 4: Scale of Measurement ===")
print("  F1 scores are ratio scale → assumption satisfied.")

print("\n=== Run ANOVA ===")
f_stat, p_anova = stats.f_oneway(*groups)
print(f"  F = {f_stat:.3f}, p = {p_anova:.6f}")

text

=== Assumption 1: Independence ===
  Cannot be tested statistically. Verify by design.
  These are independent CV folds → assumption satisfied.

=== Assumption 2: Normality (Shapiro-Wilk) ===
  Model A: W=0.9598, p=0.8151 → NOT VIOLATED
  Model B: W=0.9344, p=0.5524 → NOT VIOLATED
  Model C: W=0.9556, p=0.7686 → NOT VIOLATED

=== Assumption 3: Homoscedasticity ===
  Levene's test: stat=0.1823, p=0.8341
  Variances: [0.000249, 0.000067, 0.000079]
  Max/min variance ratio: 3.72
  Levene's: NOT VIOLATED

=== Assumption 4: Scale of Measurement ===
  F1 scores are ratio scale → assumption satisfied.

=== Run ANOVA ===
  F = 157.432, p = 0.000000

What to Do When Assumptions Fail

1. Independence violated

Redesign. There is no statistical fix. If observations are autocorrelated (time series), use repeated-measures ANOVA or linear mixed models. If folds overlap, fix the CV split first.

2. Normality violated (Shapiro-Wilk p < 0.05 for any group)

n ≥ 30 per group: ANOVA is robust via CLT — proceed.
n < 30: switch to Kruskal-Wallis. Code:

python

stat_kw, p_kw = stats.kruskal(*groups)
print(f"Kruskal-Wallis: H={stat_kw:.3f}, p={p_kw:.6f}")
print("  Same as ANOVA here — significant differences exist")
print("  (Both tests agree when data is well-behaved)")

text

Kruskal-Wallis: H=18.558, p=0.000093
  Same as ANOVA here — significant differences exist
  (Both tests agree when data is well-behaved)

Kruskal-Wallis rank-transforms the data before testing — it loses information (lower power than ANOVA when normality holds), so don't default to it out of caution when normality passes. Consider log transformation if residuals are right-skewed.

3. Homoscedasticity violated (Levene's p < 0.05)

Equal group sizes: ANOVA is moderately robust — proceed with caution.
Unequal group sizes: use Welch's ANOVA, which does not pool variances. Each group uses its own variance estimate, producing a more conservative (wider) confidence interval but a valid one.

python

ns = [len(g) for g in groups]
means = [np.mean(g) for g in groups]
vars_ = [np.var(g, ddof=1) for g in groups]
weights = [n/v for n, v in zip(ns, vars_)]
grand_mean_w = sum(w * m for w, m in zip(weights, means)) / sum(weights)
F_welch_num = sum(w * (m - grand_mean_w)**2 for w, m in zip(weights, means)) / (len(groups) - 1)
print(f"Welch's F numerator ≈ {F_welch_num:.3f} (use pingouin for full correction)")

text

Welch's F numerator ≈ 1098.432 (use pingouin for full correction)

4. Scale violated

Use Kruskal-Wallis for ordinal dependent variables, chi-square for nominal.

Assumption Trace Table

Assumption	Test	Values from Anchor	Verdict
Independence	Design review	8 independent CV folds	Satisfied
Normality — Model A	Shapiro-Wilk	W=0.9598, p=0.8151	Pass
Normality — Model B	Shapiro-Wilk	W=0.9344, p=0.5524	Pass
Normality — Model C	Shapiro-Wilk	W=0.9556, p=0.7686	Pass
Homoscedasticity	Levene's test	stat=0.1823, p=0.8341	Pass
Variance ratio	max/min	0.000249 / 0.000067 = 3.72	Borderline
Scale of measurement	Domain knowledge	F1 = ratio scale	Satisfied

Assumptions Summary

Assumption	Test	Passes on Anchor?	Action if Violated
Independence	Design review	Yes (independent folds)	Repeated-measures ANOVA or mixed models
Normality	Shapiro-Wilk per group	Yes (p > 0.05 all groups)	Kruskal-Wallis (small n); proceed (large n)
Homoscedasticity	Levene's test	Yes (p=0.834)	Welch's ANOVA
Interval/ratio scale	Domain knowledge	Yes (F1 = ratio scale)	Kruskal-Wallis or chi-square

Robustness Summary

Violation	Impact with Equal n	Impact with Unequal n	Fix
Mild non-normality	Negligible (n ≥ 30)	Moderate	Kruskal-Wallis if n < 15
Unequal variances	Moderate	Severe	Welch's ANOVA
Non-independence	Severe	Severe	Redesign

Assumption checking is itself an application of hypothesis testing — each diagnostic test (Shapiro-Wilk, Levene's) has its own H₀, its own Type I and II errors, and its own power curve. That is why a Shapiro-Wilk "pass" at n=8 means very little: the test lacked power to detect a violation even if one existed. Normality in ANOVA connects directly to the normality assumption in the two-sample t-test, and the equal-variance assumption here motivates exactly the same reason Welch's t-test exists. Kruskal-Wallis is the ANOVA analog of the Mann-Whitney U test for two-group comparisons. Repeated-measures ANOVA and linear mixed models extend ANOVA when the independence assumption does not hold — they model the correlation structure between repeated observations rather than pretending it isn't there.

Honest Limitations

All three statistical assumption tests — Shapiro-Wilk, Levene's, and Bartlett's — have low power at small n. With n=8 per group, neither Shapiro-Wilk nor Levene's can reliably detect moderate violations. The correct response is not "the tests passed, so I am safe" — it is "I cannot detect violations at this sample size, so I should default to robust alternatives (Welch's ANOVA or Kruskal-Wallis) unless normality and variance homogeneity are guaranteed by the data-generating process." Assumption checks add real value when samples are large enough for the tests to have power; at very small n, treat them as weak evidence and build in robustness by design.

Test Your Understanding

You add a fourth model group with a much larger variance than the others. Levene's test rejects at p=0.03. Group sizes are unequal (n=8, 8, 8, 5). Why is the combination of unequal variances and unequal n particularly damaging for standard ANOVA, and what test should you run instead?
Shapiro-Wilk returns p=0.44 for Model A. A colleague says "this confirms normality." Is that a valid conclusion? What does p=0.44 actually tell you?
You transform all F1 scores using logit(F1) = log(F1 / (1 − F1)) before running ANOVA. Which assumption is this transformation most likely trying to satisfy, and how would you verify whether it helped?
A study measures the same 8 models multiple times on the same test set and treats each measurement as an independent observation. Which ANOVA assumption does this violate, and what is the correct analysis?
Kruskal-Wallis rank-transforms the data before testing. Does this mean Kruskal-Wallis has no assumptions? What assumptions does it make, and in what scenario would it also fail?

ANOVA Assumptions

The Dataset

Assumption 1: Independence of Observations

Assumption 2: Normality Within Groups

Assumption 3: Homoscedasticity (Equal Variances)

Assumption 4: Interval or Ratio Scale Measurement

Checking All Assumptions Programmatically

What to Do When Assumptions Fail

Assumption Trace Table

Assumptions Summary

Robustness Summary

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

ANOVA Assumptions

The Dataset

Assumption 1: Independence of Observations

Assumption 2: Normality Within Groups

Assumption 3: Homoscedasticity (Equal Variances)

Assumption 4: Interval or Ratio Scale Measurement

Checking All Assumptions Programmatically

What to Do When Assumptions Fail

Assumption Trace Table

Assumptions Summary

Robustness Summary

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment