← View series: machine learning
~/blog
Classification Performance Metrics
Accuracy is one number. Classification produces four outcomes. The entire story of model quality — who you're catching, who you're missing, and what false alarms cost — lives in the confusion matrix and the curves derived from it. This post computes every metric by hand on a loan default dataset, then shows when each metric misleads.
Anchor dataset: 20-sample loan default predictions from a logistic regression model.
import numpy as np
y_true = np.array([1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0])
y_prob = np.array([0.95,0.89,0.78,0.71,0.65,0.42,0.38,0.22,
0.81,0.63,0.45,0.41,0.35,0.28,0.19,0.15,0.11,0.08,0.05,0.03])
# 8 actual defaults, 12 actual non-defaultsStep 1: Building the Confusion Matrix at Threshold 0.5
Apply threshold 0.5 manually:
- Predicted positive (ŷ=1):
y_prob > 0.5→ samples with prob [0.95, 0.89, 0.78, 0.71, 0.65, 0.81, 0.63] = 7 samples- True defaults among them: [0.95✓, 0.89✓, 0.78✓, 0.71✓, 0.65✓] = 5 TP
- True non-defaults: [0.81✗, 0.63✗] = 2 FP
- Predicted negative (ŷ=0): 13 samples
- True defaults missed: [0.42, 0.38, 0.22] = 3 FN
- True non-defaults: 10 TN
from sklearn.metrics import confusion_matrix
y_pred = (y_prob >= 0.5).astype(int)
cm = confusion_matrix(y_true, y_pred)
print(cm)[[10 2]
[ 3 5]]
<text x="230" y="42" text-anchor="middle" font-size="10" fill="#334155" font-weight="bold">Predicted</text>
<text x="180" y="60" text-anchor="middle" font-size="10" fill="#334155">No Default (0)</text>
<text x="280" y="60" text-anchor="middle" font-size="10" fill="#334155">Default (1)</text>
<text x="60" y="120" text-anchor="middle" font-size="10" fill="#334155" font-weight="bold" transform="rotate(-90,60,120)">Actual</text>
<text x="90" y="110" text-anchor="middle" font-size="10" fill="#334155">No Default (0)</text>
<text x="90" y="175" text-anchor="middle" font-size="10" fill="#334155">Default (1)</text>
<rect x="140" y="70" width="90" height="80" fill="#dcfce7" stroke="#22c55e" stroke-width="2" rx="4"/>
<text x="185" y="108" text-anchor="middle" font-size="22" font-weight="bold" fill="#16a34a">10</text>
<text x="185" y="124" text-anchor="middle" font-size="9" fill="#16a34a">TN</text>
<rect x="230" y="70" width="90" height="80" fill="#fee2e2" stroke="#ef4444" stroke-width="2" rx="4"/>
<text x="275" y="108" text-anchor="middle" font-size="22" font-weight="bold" fill="#dc2626">2</text>
<text x="275" y="124" text-anchor="middle" font-size="9" fill="#dc2626">FP (False Alarm)</text>
<rect x="140" y="150" width="90" height="80" fill="#fef3c7" stroke="#f59e0b" stroke-width="2" rx="4"/>
<text x="185" y="188" text-anchor="middle" font-size="22" font-weight="bold" fill="#d97706">3</text>
<text x="185" y="204" text-anchor="middle" font-size="9" fill="#d97706">FN (Missed)</text>
<rect x="230" y="150" width="90" height="80" fill="#dcfce7" stroke="#22c55e" stroke-width="2" rx="4"/>
<text x="275" y="188" text-anchor="middle" font-size="22" font-weight="bold" fill="#16a34a">5</text>
<text x="275" y="204" text-anchor="middle" font-size="9" fill="#16a34a">TP</text>
Step 2: All Metrics from the Confusion Matrix
| Metric | Formula | Computation | Value |
|---|---|---|---|
| Accuracy | 0.750 | ||
| Precision | 0.714 | ||
| Recall (Sensitivity) | 0.625 | ||
| Specificity | 0.833 | ||
| F1 Score | 0.667 | ||
| Miss Rate (FNR) | 0.375 | ||
| Fall-out (FPR) | 0.167 |
Business interpretation for loan default:
- Precision = 0.714: of 7 loans we flagged as high-risk, 5 were actual defaults. 2 customers were denied loans unnecessarily.
- Recall = 0.625: of 8 actual defaults, we caught 5. We missed 3 defaulters who received loans and likely won't repay them.
- Which matters more? A bank losing money on missed defaults (FN) typically cares more about Recall. A customer discrimination lawsuit from false alarms (FP) shifts priority to Precision. The right metric depends on the asymmetry of the business cost.
- Accuracy = 75% is misleading: if 2% of loans default and you always predict "no default," accuracy = 98%. But recall = 0% — you've detected nothing.
Step 3: The Precision-Recall Tradeoff
As you lower the threshold, you flag more samples as positive (higher recall, lower precision). As you raise it, fewer are flagged (higher precision, lower recall):
| Threshold | TP | FP | FN | TN | Precision | Recall |
|---|---|---|---|---|---|---|
| 0.3 | 7 | 5 | 1 | 7 | 7/12 = 0.583 | 7/8 = 0.875 |
| 0.5 | 5 | 2 | 3 | 10 | 5/7 = 0.714 | 5/8 = 0.625 |
| 0.7 | 4 | 0 | 4 | 12 | 4/4 = 1.000 | 4/8 = 0.500 |
| 0.9 | 2 | 0 | 6 | 12 | 2/2 = 1.000 | 2/8 = 0.250 |
At threshold 0.7 and 0.9: precision = 1.0 because the only flagged samples are true positives. But recall drops — we're missing more defaulters. At threshold 0.3: catch 7 of 8 defaulters but also flag 5 non-defaulters.
<text x="53" y="222" font-size="8" fill="#64748b">0</text>
<text x="408" y="222" font-size="8" fill="#64748b" text-anchor="end">1</text>
<text x="44" y="20" font-size="8" fill="#64748b" text-anchor="end">1</text>
<text x="44" y="225" font-size="8" fill="#64748b" text-anchor="end">0</text>
<path d="M195,225 L195,100 L285,60 L375,18" fill="none" stroke="#3b82f6" stroke-width="2"/>
<circle cx="195" cy="100" r="5" fill="#f59e0b"/><text x="190" y="95" font-size="8" fill="#64748b" text-anchor="end">t=0.3</text>
<circle cx="231" cy="70" r="5" fill="#f59e0b"/><text x="236" y="65" font-size="8" fill="#64748b">t=0.5</text>
<circle cx="285" cy="60" r="5" fill="#f59e0b"/><text x="290" y="55" font-size="8" fill="#64748b">t=0.7</text>
<circle cx="375" cy="18" r="5" fill="#f59e0b"/><text x="380" y="13" font-size="8" fill="#64748b">t=0.9</text>
Step 4: ROC Curve and AUC
The ROC curve plots True Positive Rate (Recall) vs False Positive Rate at each threshold:
| Threshold | FPR = FP/(FP+TN) | TPR = TP/(TP+FN) |
|---|---|---|
| 1.0 | 0/12 = 0.000 | 0/8 = 0.000 |
| 0.9 | 0/12 = 0.000 | 2/8 = 0.250 |
| 0.7 | 0/12 = 0.000 | 4/8 = 0.500 |
| 0.5 | 2/12 = 0.167 | 5/8 = 0.625 |
| 0.3 | 5/12 = 0.417 | 7/8 = 0.875 |
| 0.0 | 12/12 = 1.000 | 8/8 = 1.000 |
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_true, y_prob)
print(f"AUC-ROC: {auc:.4f}")AUC-ROC: 0.8750
<line x1="50" y1="225" x2="410" y2="15" stroke="#cbd5e1" stroke-width="1.5" stroke-dasharray="5,4"/>
<text x="350" y="90" font-size="9" fill="#94a3b8">random (AUC=0.5)</text>
<path d="M50,225 L50,162 L50,120 L110,100 L200,80 L50,225" fill="#dbeafe" fill-opacity="0.4"/>
<path d="M50,225 L50,162 L50,120 L110,100 L200,80" fill="none" stroke="#3b82f6" stroke-width="2.5"/>
<circle cx="50" cy="225" r="4" fill="#3b82f6"/>
<circle cx="50" cy="162" r="4" fill="#3b82f6"/>
<circle cx="50" cy="120" r="4" fill="#3b82f6"/>
<circle cx="110" cy="100" r="4" fill="#3b82f6"/>
<circle cx="200" cy="80" r="4" fill="#3b82f6"/>
<circle cx="410" cy="15" r="4" fill="#3b82f6"/>
<text x="130" y="170" font-size="11" font-weight="bold" fill="#3b82f6">AUC = 0.875</text>
AUC = 0.875 means: if you randomly pick one defaulter and one non-defaulter from the dataset, there's an 87.5% chance the model assigns a higher probability to the defaulter. AUC is threshold-independent — it measures the model's discriminative ability across all possible thresholds.
Step 5: F1 Score and the Beta-F Score
F1 is the harmonic mean of Precision and Recall:
The harmonic mean is lower than the arithmetic mean () and is dominated by whichever is smaller — a model with Precision=0.99 but Recall=0.10 gets F1=0.18, not a flattering 0.55.
When Recall matters more than Precision (catching defaulters is critical), use with :
(Recall weighted 2× more):
(Precision weighted 2× more):
because low recall is penalized harder. because the model's precision of 0.714 is respectable.
Code Summary
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred, target_names=['No Default', 'Default'])) precision recall f1-score support
No Default 0.77 0.83 0.80 12
Default 0.71 0.62 0.67 8
accuracy 0.75 20
macro avg 0.74 0.73 0.73 20
weighted avg 0.75 0.75 0.74 20
Metric Selection Guide
| Business Question | Metric to Use |
|---|---|
| How often is our model right overall? | Accuracy (only if classes are balanced) |
| Of our flagged loans, how many default? | Precision |
| Of all actual defaults, how many did we catch? | Recall |
| Balance between precision and recall? | F1 |
| Catching defaults is critical (FN is costly)? | or Recall |
| Comparing models across thresholds? | AUC-ROC |
| Severe class imbalance (rare defaults)? | AUC-PR |
Related Concepts and Honest Limitations
The confusion matrix and all derived metrics depend on the chosen threshold. A model with AUC-ROC = 0.875 and F1 = 0.667 at threshold 0.5 might have F1 = 0.75 at threshold 0.35. Always visualize the Precision-Recall and ROC curves before committing to a threshold — the threshold should be chosen by the business cost ratio of FP to FN, not arbitrarily set to 0.5.
AUC-ROC = 0.875 looks strong here, but on a severely imbalanced dataset (99% non-default), a model that always outputs a slightly lower probability for the 1% fraud samples can achieve AUC = 0.95 while still being essentially useless for fraud detection. Post 07 covers this with the AUC-PR metric for imbalanced classification.
Test Your Understanding
-
The confusion matrix gives TP=5, FP=2, FN=3, TN=10. If the bank loses $50k per missed defaulter (FN) and $5k per false alarm (FP), what is the total expected cost at threshold 0.5? At threshold 0.3 (where TP=7, FP=5, FN=1, TN=7)?
-
AUC-ROC = 0.875 means an 87.5% chance that a randomly drawn defaulter has a higher predicted probability than a randomly drawn non-defaulter. If you shuffle the predicted probabilities randomly (destroying the model), what would AUC-ROC be?
-
A model achieves Precision=0.90 and Recall=0.30. The F1 is 0.45. A second model has Precision=0.60 and Recall=0.60. Its F1 is also 0.60. Which model does the harmonic mean favor, and why is this the right choice?
-
We computed and . As , what value does converge to? As ?
-
The miss rate (FNR = 0.375) and recall (TPR = 0.625) sum to 1.0. Is this always true? Prove it from the formulas.