Back to blog
← View series: statistics

~/blog

Chi-Square Test

Apr 11, 20268 min readBy mohammed.vasim
StatisticsMathData Science

Your model makes predictions in three categories: Positive, Negative, and Neutral. You have two versions of the model — call them Model A and Model B. Are the prediction distributions the same, or does one version systematically differ in how it assigns categories? Continuous measurements call for t-tests. Categorical counts call for the chi-square test.

The chi-square test asks: do observed counts match expected counts under a hypothesis? It handles the question of whether two categorical variables are independent, whether distributions differ across groups, and whether data follows a specified distribution. Any time you have counts in categories and a question about whether the pattern is random, chi-square is the natural tool.

The Dataset

You deploy both model versions on 200 held-out inputs (100 per model) and record the predicted category:

PositiveNegativeNeutralTotal
Model A423820100
Model B552817100
Total976637200

Are Model A and Model B distributing predictions the same way across categories?

The Basic Idea

The chi-square statistic measures how far observed frequencies are from expected frequencies:

where is the observed count and is the expected count under .

Large means large discrepancy — evidence against . Under , this statistic follows a chi-square distribution with degrees of freedom determined by the table structure.

Hypotheses

: Model A and Model B produce the same prediction distribution across categories (the variables "model version" and "predicted category" are independent)

: The prediction distributions differ

Significance level:

Phase 1: Expected Frequencies

Under independence, the expected count for each cell is:

Computing each cell:

All expected frequencies exceed 5, so the chi-square approximation is valid.

Observed (amber) vs Expected under H0 (gray) Positive Negative Neutral Model A 42 48.5 38 33.0 20 18.5 Model B 55 48.5 28 33.0 17 18.5

Phase 2: Per Cell

Cell
A, Positive4248.5-6.542.250.871
A, Negative3833.0+5.025.000.758
A, Neutral2018.5+1.52.250.122
B, Positive5548.5+6.542.250.871
B, Negative2833.0-5.025.000.758
B, Neutral1718.5-1.52.250.122

critical = 5.991 chi-sq = 3.502 reject 0 Chi-square distribution (df=2) — observed stat is inside the non-rejection zone

Phase 3: Degrees of Freedom and Decision

where rows (models) and columns (categories).

Critical value:

Decision: — fail to reject .

The data is consistent with Model A and Model B producing the same prediction distribution. The observed differences in counts are within normal random variation.

PhaseFormulaValuesResult
Expected cellsall 6 cells computedmin
Chi-square statsum of 6 cell contributions
Degrees of freedom
Decision?Fail to reject

Effect Size: Cramér's V

Statistical significance does not equal practical importance. Cramér's V normalizes the chi-square for sample size and table dimensions:

Cramér's VInterpretation
Small effect
Medium effect
Large effect

is a small effect — consistent with the non-significant chi-square. Even if the test had achieved significance with more data, the practical association is modest.

Fisher's Exact Test: When Expected Counts Are Small

If any expected cell count is below 5, the chi-square approximation is unreliable. Fisher's exact test computes the exact probability of the observed table (and more extreme tables) under , without approximation. Use it when:

  • Any
  • Sample size is small overall
  • The table is 2×2
python
import numpy as np
from scipy import stats

# Model prediction contingency table
contingency_table = np.array([
    [42, 38, 20],   # Model A: Positive, Negative, Neutral
    [55, 28, 17]    # Model B: Positive, Negative, Neutral
])

chi2, p, dof, expected = stats.chi2_contingency(contingency_table)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"p-value: {p:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"\nExpected frequencies:")
print(expected)

n = contingency_table.sum()
min_dim = min(contingency_table.shape) - 1
cramers_v = np.sqrt(chi2 / (n * min_dim))
print(f"\nCramer's V: {cramers_v:.4f}")

# Standardized residuals (which cells drive the chi-square most?)
residuals = (contingency_table - expected) / np.sqrt(expected)
print("\nStandardized residuals:")
print(residuals)

# Fisher's exact test (2x2 only — use if expected cells are small)
small_table = np.array([[42, 58], [55, 45]])
odds_ratio, p_fisher = stats.fisher_exact(small_table)
print(f"\nFisher's exact test (2x2 subset, Pos vs rest): p={p_fisher:.4f}")
Chi-square statistic: 3.5020 p-value: 0.1736 Degrees of freedom: 2 Expected frequencies: [[48.5 33. 18.5] [48.5 33. 18.5]] Cramer's V: 0.1323 Standardized residuals: [[-0.934 0.871 0.349] [ 0.934 -0.871 -0.349]] Fisher's exact test (2x2 subset, Pos vs rest): p=0.0837

The Assumptions

  1. Categorical data: Counts in categories, not measurements
  2. Independent observations: Each prediction belongs to exactly one cell
  3. Adequate expected frequencies: At least 80% of cells should have , no cell

If expected counts are too small: combine categories (e.g., merge Neutral into Negative/Positive), increase sample size, or use Fisher's exact test.

Three Flavors

Goodness of Fit: One variable. Does this model's prediction distribution match a theoretical distribution (e.g., uniform across classes)?

Test of Independence: Two variables, contingency table. Are model version and prediction category independent?

Homogeneity: Are prediction distributions identical across multiple model versions? (Same mechanics as independence, different question.)

The chi-square test extends hypothesis testing (post 3) to categorical outcomes, where t-tests and Z-tests do not apply. It uses the chi-square distribution, which arises as the sum of squared standard normals — connected to the CLT (post 1) via the Normal approximation for counts. The goodness-of-fit test (post 13) applies the same statistic to single-variable questions like "does my model's output distribution match the true label distribution?" ANOVA (post 14) handles the continuous analog: comparing means across multiple groups instead of comparing count distributions.

Honest Limitations

The chi-square test detects association but says nothing about direction, magnitude, or cause. A significant chi-square means the prediction distributions differ — it does not tell you which category is causing the difference. Use standardized residuals (shown in the output above) to identify which cells contribute most. Also, chi-square only applies to counts — not proportions, means, or scores. Treating percentages or rates as counts inflates the test statistic and produces misleading results.

Test Your Understanding

  1. You add a third model (Model C) with counts [48, 35, 17] in the same three categories. Recompute the expected frequencies for the 3×3 table and the new degrees of freedom.
  2. A small experiment has only 30 predictions per model, resulting in expected cell counts of [14.5, 9.9, 5.6]. Can you safely use the chi-square test? What would you do instead?
  3. Cramér's V = 0.132 for the comparison above. If you doubled the sample to 400 total predictions with the same proportions, would V change? Would the chi-square statistic change?
  4. The standardized residual for cell (Model A, Positive) is -0.934. What does this tell you about how Model A's positive predictions compare to what independence predicts?
  5. Two models predict binary labels (Positive vs Not-Positive) with counts [[42, 58], [55, 45]]. Compute the 2×2 chi-square by hand and verify it equals the sum of across all four cells.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment