← View series: statistics
~/blog
Goodness of Fit Test
Your model predicts sentiment in three classes — Positive, Negative, and Neutral — and you want to know whether the predicted class distribution matches the true label distribution in your dataset. You expected roughly 40% Positive, 35% Negative, 25% Neutral based on historical annotation data. What came out was different. The goodness of fit test answers whether that discrepancy is meaningful or just noise.
What It Tests
The goodness of fit test asks: does the observed distribution of categorical data match a hypothesized distribution?
: The data follows the specified distribution (predicted proportions match expected proportions)
: The data does not follow the specified distribution
You can test against any hypothesized distribution — a uniform distribution (is the model just guessing?), historical class proportions, or theoretically derived probabilities.
The Dataset
Your model processes validation examples and produces:
| Class | Observed () | Expected proportion | Expected () |
|---|---|---|---|
| Positive | 88 | 0.40 | |
| Negative | 62 | 0.35 | |
| Neutral | 50 | 0.25 | |
| Total | 200 | 1.00 | 200 |
All expected frequencies exceed 5, so the chi-square approximation is valid.
The Test Statistic
Computing each term:
Degrees of Freedom and Decision
where categories and estimated parameters (the expected proportions are given, not estimated from the same data).
Critical value:
Decision: — fail to reject .
The model's prediction distribution is consistent with the expected class proportions. The overproduction of Positive predictions and underproduction of Negative predictions are within normal random variation at .
| Phase | Formula | Values | Result |
|---|---|---|---|
| Expected counts | |||
| Per-cell contributions | |||
| Chi-square | contributions | ||
| Decision | Fail to reject |
Effect Size: Cramér's V for Goodness of Fit
For goodness of fit, Cramér's V measures how far the observed distribution deviates from the expected, normalized by sample size:
A V of 0.065 is tiny — the model's prediction distribution is very close to the expected proportions.
Estimated Parameters Reduce df
When you estimate distribution parameters from the same data to compute expected frequencies, you lose one degree of freedom per estimated parameter. Example: testing whether model confidence scores follow a Normal distribution, where you estimate and from the scores — this loses 2 df:
Failing to subtract estimated-parameter df produces an anti-conservative test that rejects too often.
Python Code
import numpy as np
from scipy import stats
# Sentiment model: observed vs expected from historical proportions
observed_counts = np.array([88, 62, 50]) # Positive, Negative, Neutral
expected_probs = np.array([0.40, 0.35, 0.25]) # historical distribution
n = observed_counts.sum()
expected_counts = n * expected_probs
chi2_stat, p_value = stats.chisquare(observed_counts, expected_counts)
df = len(observed_counts) - 1
print(f"Observed: {observed_counts}")
print(f"Expected: {expected_counts}")
print(f"Chi-square: {chi2_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"df: {df}")
print(f"Conclusion: {'Reject H0' if p_value < 0.05 else 'Fail to reject H0'}")
# Cramer's V
cramers_v = np.sqrt(chi2_stat / (n * (len(observed_counts) - 1)))
print(f"Cramer's V: {cramers_v:.4f}")
# Per-cell contributions
contributions = (observed_counts - expected_counts)**2 / expected_counts
for cls, obs, exp, contrib in zip(["Positive", "Negative", "Neutral"],
observed_counts, expected_counts, contributions):
print(f" {cls}: O={obs}, E={exp:.1f}, (O-E)^2/E={contrib:.4f}")Observed: [88 62 50]
Expected: [80. 70. 50.]
Chi-square: 1.7143
p-value: 0.4244
df: 2
Conclusion: Fail to reject H0
Cramer's V: 0.0655
Positive: O=88, E=80.0, (O-E)^2/E=0.8000
Negative: O=62, E=70.0, (O-E)^2/E=0.9143
Neutral: O=50, E=50.0, (O-E)^2/E=0.0000
When Assumptions Fail
If expected frequencies are too small (any cell ):
Combine categories: Merge Neutral into the nearest category to increase expected counts.
Fisher's exact test: Exact probability for 2-category tables.
Bootstrap p-value: Sample from the expected distribution repeatedly and compute the empirical distribution of the chi-square statistic.
# Bootstrap chi-square p-value for small expected counts
np.random.seed(42)
n_bootstrap = 10000
chi2_obs = 1.714
bootstrap_chi2 = []
for _ in range(n_bootstrap):
simulated = np.random.multinomial(n, expected_probs)
chi2_sim = np.sum((simulated - expected_counts)**2 / expected_counts)
bootstrap_chi2.append(chi2_sim)
p_bootstrap = np.mean(np.array(bootstrap_chi2) >= chi2_obs)
print(f"Bootstrap p-value: {p_bootstrap:.4f}")Bootstrap p-value: 0.4212
The Large-Sample Problem
With very large samples, even trivially small deviations from the expected distribution become significant. With and the same proportions scaled up, the chi-square scales linearly with and the same small effect becomes . A statistically significant result with Cramér's V near 0.06 is still practically negligible. Always look at effect size alongside p-values.
Related Concepts
The goodness of fit test is a one-variable application of the chi-square framework introduced in post 12, which covers the two-variable independence test. The test statistic, degrees of freedom, and effect size calculation are the same family of tools. The goodness of fit test connects backward to the CLT (post 1): the chi-square distribution arises as the sum of squared Normal deviations, so the test's validity depends on large enough expected counts for the Normal approximation to hold. It connects forward to model calibration: testing whether predicted probability deciles match observed outcome rates is a chi-square goodness of fit test applied to model reliability.
Honest Limitations
The goodness of fit test only tells you whether distributions match — it does not tell you which class is problematic. Use standardized residuals to identify which categories drive the discrepancy. The test also assumes the expected proportions are known without error. If you estimated expected proportions from a separate dataset that is itself noisy, the uncertainty in those estimates should be propagated — a more complex analysis than a simple goodness of fit test provides.
Test Your Understanding
- You increase the sample to 500 examples. The proportions remain the same: 220 Positive, 155 Negative, 125 Neutral. Compute the chi-square statistic and compare it to the result from . What does this reveal about the relationship between sample size and chi-square?
- Your model is expected to produce a uniform distribution (equal thirds: 33.3% per class). Instead you observe [100, 65, 35]. Test this hypothesis with and .
- A colleague estimates the expected probabilities from the same 200-example test set (rather than using historical data). They use . What is the correct df in this case, and what goes wrong if is used?
- The Neutral class had and contributed 0 to the chi-square. Does this mean the model's Neutral predictions are perfect? What would you additionally check?
- You want to test whether your model's confidence scores (binned into deciles) follow a uniform distribution. There are 10 bins with 150 observations each. What are the expected count per bin, the degrees of freedom, and the critical value at ?