← View series: statistics
~/blog
Chi-Square Test
Your model makes predictions in three categories: Positive, Negative, and Neutral. You have two versions of the model — call them Model A and Model B. Are the prediction distributions the same, or does one version systematically differ in how it assigns categories? Continuous measurements call for t-tests. Categorical counts call for the chi-square test.
The chi-square test asks: do observed counts match expected counts under a hypothesis? It handles the question of whether two categorical variables are independent, whether distributions differ across groups, and whether data follows a specified distribution. Any time you have counts in categories and a question about whether the pattern is random, chi-square is the natural tool.
The Dataset
You deploy both model versions on 200 held-out inputs (100 per model) and record the predicted category:
| Positive | Negative | Neutral | Total | |
|---|---|---|---|---|
| Model A | 42 | 38 | 20 | 100 |
| Model B | 55 | 28 | 17 | 100 |
| Total | 97 | 66 | 37 | 200 |
Are Model A and Model B distributing predictions the same way across categories?
The Basic Idea
The chi-square statistic measures how far observed frequencies are from expected frequencies:
where is the observed count and is the expected count under .
Large means large discrepancy — evidence against . Under , this statistic follows a chi-square distribution with degrees of freedom determined by the table structure.
Hypotheses
: Model A and Model B produce the same prediction distribution across categories (the variables "model version" and "predicted category" are independent)
: The prediction distributions differ
Significance level:
Phase 1: Expected Frequencies
Under independence, the expected count for each cell is:
Computing each cell:
All expected frequencies exceed 5, so the chi-square approximation is valid.
Phase 2: Per Cell
| Cell | |||||
|---|---|---|---|---|---|
| A, Positive | 42 | 48.5 | -6.5 | 42.25 | 0.871 |
| A, Negative | 38 | 33.0 | +5.0 | 25.00 | 0.758 |
| A, Neutral | 20 | 18.5 | +1.5 | 2.25 | 0.122 |
| B, Positive | 55 | 48.5 | +6.5 | 42.25 | 0.871 |
| B, Negative | 28 | 33.0 | -5.0 | 25.00 | 0.758 |
| B, Neutral | 17 | 18.5 | -1.5 | 2.25 | 0.122 |
Phase 3: Degrees of Freedom and Decision
where rows (models) and columns (categories).
Critical value:
Decision: — fail to reject .
The data is consistent with Model A and Model B producing the same prediction distribution. The observed differences in counts are within normal random variation.
| Phase | Formula | Values | Result |
|---|---|---|---|
| Expected cells | all 6 cells computed | min | |
| Chi-square stat | sum of 6 cell contributions | ||
| Degrees of freedom | |||
| Decision | ? | Fail to reject |
Effect Size: Cramér's V
Statistical significance does not equal practical importance. Cramér's V normalizes the chi-square for sample size and table dimensions:
| Cramér's V | Interpretation |
|---|---|
| Small effect | |
| Medium effect | |
| Large effect |
is a small effect — consistent with the non-significant chi-square. Even if the test had achieved significance with more data, the practical association is modest.
Fisher's Exact Test: When Expected Counts Are Small
If any expected cell count is below 5, the chi-square approximation is unreliable. Fisher's exact test computes the exact probability of the observed table (and more extreme tables) under , without approximation. Use it when:
- Any
- Sample size is small overall
- The table is 2×2
import numpy as np
from scipy import stats
# Model prediction contingency table
contingency_table = np.array([
[42, 38, 20], # Model A: Positive, Negative, Neutral
[55, 28, 17] # Model B: Positive, Negative, Neutral
])
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"p-value: {p:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"\nExpected frequencies:")
print(expected)
n = contingency_table.sum()
min_dim = min(contingency_table.shape) - 1
cramers_v = np.sqrt(chi2 / (n * min_dim))
print(f"\nCramer's V: {cramers_v:.4f}")
# Standardized residuals (which cells drive the chi-square most?)
residuals = (contingency_table - expected) / np.sqrt(expected)
print("\nStandardized residuals:")
print(residuals)
# Fisher's exact test (2x2 only — use if expected cells are small)
small_table = np.array([[42, 58], [55, 45]])
odds_ratio, p_fisher = stats.fisher_exact(small_table)
print(f"\nFisher's exact test (2x2 subset, Pos vs rest): p={p_fisher:.4f}")Chi-square statistic: 3.5020
p-value: 0.1736
Degrees of freedom: 2
Expected frequencies:
[[48.5 33. 18.5]
[48.5 33. 18.5]]
Cramer's V: 0.1323
Standardized residuals:
[[-0.934 0.871 0.349]
[ 0.934 -0.871 -0.349]]
Fisher's exact test (2x2 subset, Pos vs rest): p=0.0837
The Assumptions
- Categorical data: Counts in categories, not measurements
- Independent observations: Each prediction belongs to exactly one cell
- Adequate expected frequencies: At least 80% of cells should have , no cell
If expected counts are too small: combine categories (e.g., merge Neutral into Negative/Positive), increase sample size, or use Fisher's exact test.
Three Flavors
Goodness of Fit: One variable. Does this model's prediction distribution match a theoretical distribution (e.g., uniform across classes)?
Test of Independence: Two variables, contingency table. Are model version and prediction category independent?
Homogeneity: Are prediction distributions identical across multiple model versions? (Same mechanics as independence, different question.)
Related Concepts
The chi-square test extends hypothesis testing (post 3) to categorical outcomes, where t-tests and Z-tests do not apply. It uses the chi-square distribution, which arises as the sum of squared standard normals — connected to the CLT (post 1) via the Normal approximation for counts. The goodness-of-fit test (post 13) applies the same statistic to single-variable questions like "does my model's output distribution match the true label distribution?" ANOVA (post 14) handles the continuous analog: comparing means across multiple groups instead of comparing count distributions.
Honest Limitations
The chi-square test detects association but says nothing about direction, magnitude, or cause. A significant chi-square means the prediction distributions differ — it does not tell you which category is causing the difference. Use standardized residuals (shown in the output above) to identify which cells contribute most. Also, chi-square only applies to counts — not proportions, means, or scores. Treating percentages or rates as counts inflates the test statistic and produces misleading results.
Test Your Understanding
- You add a third model (Model C) with counts [48, 35, 17] in the same three categories. Recompute the expected frequencies for the 3×3 table and the new degrees of freedom.
- A small experiment has only 30 predictions per model, resulting in expected cell counts of [14.5, 9.9, 5.6]. Can you safely use the chi-square test? What would you do instead?
- Cramér's V = 0.132 for the comparison above. If you doubled the sample to 400 total predictions with the same proportions, would V change? Would the chi-square statistic change?
- The standardized residual for cell (Model A, Positive) is -0.934. What does this tell you about how Model A's positive predictions compare to what independence predicts?
- Two models predict binary labels (Positive vs Not-Positive) with counts [[42, 58], [55, 45]]. Compute the 2×2 chi-square by hand and verify it equals the sum of across all four cells.