~/blog

Chi-Square Goodness-of-Fit Test

Apr 11, 2026•10 min read•By Mohammed Vasim

StatisticsMathData Science

Your multiclass classifier predicts animal categories — cat, dog, bird, fish — across 1000 test examples. Is the model guessing uniformly at random? Or does it match the real-world population distribution? The goodness-of-fit test answers whether the gap between what you observed and what you expected is signal or noise.

What the Test Asks

The goodness-of-fit (GoF) test asks: does the observed frequency distribution in one categorical variable match a specified theoretical distribution? This has nothing to do with the relationship between two variables — that's the independence test (previous post). Here you have one variable (predicted class), and you are asking how well its counts align with a hypothesized distribution.

ML use cases:

Class imbalance check: does the model's output distribution match the expected class distribution? A model predicting "cat" 60% of the time when cats are only 25% of the data has a structural problem.
Data drift detection: does the distribution of predictions on new data match the training-time distribution? GoF tests the null hypothesis "no drift."
Fairness testing: does the error rate distribution across demographic groups match uniform (equal error rates across groups)?

The Anchor

python

classes = ["cat", "dog", "bird", "fish"]
observed = [280, 320, 250, 150]    # actual predicted counts
expected_uniform = [250, 250, 250, 250]   # if model predicted uniformly
expected_population = [300, 280, 260, 160]  # if model matched population distribution
# n = 1000

Hypotheses

GoF is always one-sided — χ² is always positive, and large deviations in any direction push it up.

Test 1 — Uniform distribution (is the model just guessing?):

H₀: P(cat) = P(dog) = P(bird) = P(fish) = 0.25
H₁: The model's output is not uniformly distributed

Test 2 — Population distribution (does the model match real-world class frequencies?):

H₀: P(cat) = 0.30, P(dog) = 0.28, P(bird) = 0.26, P(fish) = 0.16
H₁: The model's output does not match the population distribution

Computing Expected Frequencies

Formula: Eᵢ = n × pᵢ

For uniform (pᵢ = 0.25 for all):

E(cat) = 1000 × 0.25 = 250
E(dog) = 1000 × 0.25 = 250
E(bird) = 1000 × 0.25 = 250
E(fish) = 1000 × 0.25 = 250
Verify: ΣEᵢ = 1000 ✓

For population distribution:

E(cat) = 1000 × 0.30 = 300
E(dog) = 1000 × 0.28 = 280
E(bird) = 1000 × 0.26 = 260
E(fish) = 1000 × 0.16 = 160
Verify: ΣEᵢ = 300 + 280 + 260 + 160 = 1000 ✓

Minimum expected frequency check: every Eᵢ ≥ 5 is required for the χ² approximation to be valid. Both distributions clear that threshold — the smallest expected count is 160.

Dog is overpredicted; fish severely underpredicted — the model favors common classes.

The Chi-Square Statistic — Step by Step

χ² = Σ [(Oᵢ − Eᵢ)² / Eᵢ]

Test 1 — Against uniform distribution:

Category	Oᵢ	Eᵢ	Oᵢ − Eᵢ	(Oᵢ − Eᵢ)²	(Oᵢ − Eᵢ)²/Eᵢ
cat	280	250	+30	900	3.600
dog	320	250	+70	4900	19.600
bird	250	250	0	0	0.000
fish	150	250	−100	10000	40.000
Total	1000	1000	0	—	63.200

Verify: deviations sum to zero, as they must. χ²_uniform = 63.200

Test 2 — Against population distribution:

Category	Oᵢ	Eᵢ	Oᵢ − Eᵢ	(Oᵢ − Eᵢ)²	(Oᵢ − Eᵢ)²/Eᵢ
cat	280	300	−20	400	1.333
dog	320	280	+40	1600	5.714
bird	250	260	−10	100	0.385
fish	150	160	−10	100	0.625
Total	1000	1000	0	—	8.057

χ²_pop = 8.057 — the model is far from uniform but much closer to the population distribution.

Degrees of Freedom and Decision

df = k − 1, where k is the number of categories. For both tests: df = 4 − 1 = 3.

Intuition: with k categories and the constraint ΣPᵢ = 1, only k−1 probabilities are free. The last is determined by the others. Same logic as Bessel's correction for variance.

Exception — estimated parameters: if the distribution parameters were estimated from the same data (e.g., fitting a Poisson and estimating λ from the observations), subtract one df per estimated parameter: df = k − 1 − m. For Poisson with estimated λ: df = k − 2. Failing to subtract gives an anti-conservative test that rejects too often.

Critical value at α=0.05, df=3: χ²* = 7.815

Test	χ²	Critical value	Decision
vs Uniform	63.200	7.815	Reject H₀
vs Population	8.057	7.815	Reject H₀ (barely)

Effect Size: Cramér's V

V = √(χ² / (n × (k − 1)))

Uniform test: V = √(63.200 / (1000 × 3)) = √0.02107 = 0.145 (small to medium effect)

Population test: V = √(8.057 / (1000 × 3)) = √0.002686 = 0.052 (tiny effect)

Despite the population test being statistically significant (p=0.045), V=0.052 indicates negligible practical deviation — the model nearly matches the population distribution. The significance is an artifact of the large sample (n=1000).

Standardized Residuals — Where Is the Misfit?

When χ² is significant, identify which categories drive the deviation:

Standardized residual = (Oᵢ − Eᵢ) / √Eᵢ

Residuals beyond ±2 flag unusual cells. For the uniform test:

Category	Residual	Flag
cat	(280−250)/√250 = +1.90	Borderline
dog	(320−250)/√250 = +4.43	Very large
bird	(250−250)/√250 = 0.00	None
fish	(150−250)/√250 = −6.32	Largest — main driver

Dog (+4.43) and fish (−6.32) are the problem categories. The model severely overpredicts dogs and underpredicts fish. Bird contributes nothing to χ².

Fish (−6.32) dominates — the model systematically underpredicts the rarest class.

Code

python

from scipy import stats
import numpy as np

observed = np.array([280, 320, 250, 150])
n = observed.sum()
categories = ["cat", "dog", "bird", "fish"]

# Test 1: uniform distribution
p_uniform = np.array([0.25, 0.25, 0.25, 0.25])
expected_uniform = n * p_uniform

chi2_uniform, p_uniform_test = stats.chisquare(observed, f_exp=expected_uniform)
V_uniform = np.sqrt(chi2_uniform / (n * (len(observed) - 1)))

# Test 2: population distribution
p_pop = np.array([0.30, 0.28, 0.26, 0.16])
expected_pop = n * p_pop

chi2_pop, p_pop_test = stats.chisquare(observed, f_exp=expected_pop)
V_pop = np.sqrt(chi2_pop / (n * (len(observed) - 1)))

# Standardized residuals (uniform)
std_resid = (observed - expected_uniform) / np.sqrt(expected_uniform)

print("GoF vs Uniform Distribution:")
for cat, o, e, r in zip(categories, observed, expected_uniform, std_resid):
    print(f"  {cat}: O={o}, E={e:.0f}, resid={r:+.2f}")
print(f"  chi2={chi2_uniform:.3f}, df=3, p={p_uniform_test:.5f}, V={V_uniform:.3f}")

print("\nGoF vs Population Distribution:")
for cat, o, e in zip(categories, observed, expected_pop):
    r = (o - e) / np.sqrt(e)
    print(f"  {cat}: O={o}, E={e:.0f}, resid={r:+.2f}")
print(f"  chi2={chi2_pop:.3f}, df=3, p={p_pop_test:.4f}, V={V_pop:.3f}")

print(f"\nMinimum expected frequency check: min(E_uniform)={expected_uniform.min()}, min(E_pop)={expected_pop.min()}")
print("  (Both > 5: chi-square approximation is valid)")

text

GoF vs Uniform Distribution:
  cat: O=280, E=250, resid=+1.90
  dog: O=320, E=250, resid=+4.43
  bird: O=250, E=250, resid=+0.00
  fish: O=150, E=250, resid=-6.32
  chi2=63.200, df=3, p=0.00000, V=0.145

GoF vs Population Distribution:
  cat: O=280, E=300, resid=-1.15
  dog: O=320, E=280, resid=+2.39
  bird: O=250, E=260, resid=-0.62
  fish: O=150, E=160, resid=-0.79
  chi2=8.057, df=3, p=0.0448, V=0.052

Minimum expected frequency check: min(E_uniform)=250.0, min(E_pop)=160.0
  (Both > 5: chi-square approximation is valid)

When Assumptions Fail

Cell counts below 5: if any Eᵢ < 5, the χ² approximation is unreliable. Fix: combine categories (if meaningful with domain knowledge), or use the exact multinomial test via scipy.stats.multinomial.

Estimated parameters: if distribution parameters were estimated from the same data — e.g., fitting a Poisson(λ) and estimating λ from those observations — then df = k − 1 − m where m = number of estimated parameters. For Poisson with estimated λ: df = k − 2. Failing to subtract yields an anti-conservative test that rejects too often.

Results Summary

Test	H₀	χ²	df	p-value	Cramér's V	Decision
GoF (uniform)	Output is uniform	63.200	3	<0.001	0.145	Reject
GoF (population)	Output matches population	8.057	3	0.045	0.052	Reject (barely)

GoF vs Chi-Square Independence Test

Aspect	GoF Test	Independence Test
Question	Does one variable match a distribution?	Are two variables associated?
Variables	One categorical	Two categorical
Expected frequencies	n × theoretical probability	Row total × column total / grand total
df	k − 1	(rows − 1) × (cols − 1)
Typical use	Model output vs prior distribution	Confusion matrix between two classifiers

Test Your Understanding

You run the same model on a new production batch of 500 examples and observe [130, 165, 125, 80]. Test the null hypothesis that the output matches the population distribution (p = [0.30, 0.28, 0.26, 0.16]). Compute χ² by hand and state the decision at α=0.05.
The fish class has only 2 examples in a smaller experiment: observed = [14, 18, 12, 2], n=46. Why is the chi-square approximation unreliable here, and what should you do instead?
A colleague tests whether confidence scores follow a Normal distribution. They estimate μ and σ from the 1000 test examples, bin the scores into 10 buckets, and use df=9. What is the correct df, and what goes wrong if they use df=9?
For the uniform GoF test (χ²=63.2, V=0.145), the effect is "small to medium." Why is V=0.145 considered small even though χ²=63.2 is enormous and p≈0?
Bird had O=E=250 in both tests and contributed 0 to χ². Does this mean the model's bird predictions are correct? What additional evidence would you need to be confident?