Back to blog
← View series: machine learning

~/blog

Classification Performance Metrics

Jun 26, 20267 min readBy Mohammed Vasim
Machine LearningAIData Science

Accuracy is one number. Classification produces four outcomes. The entire story of model quality — who you're catching, who you're missing, and what false alarms cost — lives in the confusion matrix and the curves derived from it. This post computes every metric by hand on a loan default dataset, then shows when each metric misleads.

Anchor dataset: 20-sample loan default predictions from a logistic regression model.

python
import numpy as np

y_true = np.array([1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0])
y_prob = np.array([0.95,0.89,0.78,0.71,0.65,0.42,0.38,0.22,
                   0.81,0.63,0.45,0.41,0.35,0.28,0.19,0.15,0.11,0.08,0.05,0.03])
# 8 actual defaults, 12 actual non-defaults

Step 1: Building the Confusion Matrix at Threshold 0.5

Apply threshold 0.5 manually:

  • Predicted positive (ŷ=1): y_prob > 0.5 → samples with prob [0.95, 0.89, 0.78, 0.71, 0.65, 0.81, 0.63] = 7 samples
    • True defaults among them: [0.95✓, 0.89✓, 0.78✓, 0.71✓, 0.65✓] = 5 TP
    • True non-defaults: [0.81✗, 0.63✗] = 2 FP
  • Predicted negative (ŷ=0): 13 samples
    • True defaults missed: [0.42, 0.38, 0.22] = 3 FN
    • True non-defaults: 10 TN
python
from sklearn.metrics import confusion_matrix

y_pred = (y_prob >= 0.5).astype(int)
cm = confusion_matrix(y_true, y_pred)
print(cm)
[[10 2] [ 3 5]] Confusion Matrix at Threshold 0.5 <text x="230" y="42" text-anchor="middle" font-size="10" fill="#334155" font-weight="bold">Predicted</text> <text x="180" y="60" text-anchor="middle" font-size="10" fill="#334155">No Default (0)</text> <text x="280" y="60" text-anchor="middle" font-size="10" fill="#334155">Default (1)</text> <text x="60" y="120" text-anchor="middle" font-size="10" fill="#334155" font-weight="bold" transform="rotate(-90,60,120)">Actual</text> <text x="90" y="110" text-anchor="middle" font-size="10" fill="#334155">No Default (0)</text> <text x="90" y="175" text-anchor="middle" font-size="10" fill="#334155">Default (1)</text> <rect x="140" y="70" width="90" height="80" fill="#dcfce7" stroke="#22c55e" stroke-width="2" rx="4"/> <text x="185" y="108" text-anchor="middle" font-size="22" font-weight="bold" fill="#16a34a">10</text> <text x="185" y="124" text-anchor="middle" font-size="9" fill="#16a34a">TN</text> <rect x="230" y="70" width="90" height="80" fill="#fee2e2" stroke="#ef4444" stroke-width="2" rx="4"/> <text x="275" y="108" text-anchor="middle" font-size="22" font-weight="bold" fill="#dc2626">2</text> <text x="275" y="124" text-anchor="middle" font-size="9" fill="#dc2626">FP (False Alarm)</text> <rect x="140" y="150" width="90" height="80" fill="#fef3c7" stroke="#f59e0b" stroke-width="2" rx="4"/> <text x="185" y="188" text-anchor="middle" font-size="22" font-weight="bold" fill="#d97706">3</text> <text x="185" y="204" text-anchor="middle" font-size="9" fill="#d97706">FN (Missed)</text> <rect x="230" y="150" width="90" height="80" fill="#dcfce7" stroke="#22c55e" stroke-width="2" rx="4"/> <text x="275" y="188" text-anchor="middle" font-size="22" font-weight="bold" fill="#16a34a">5</text> <text x="275" y="204" text-anchor="middle" font-size="9" fill="#16a34a">TP</text>

Step 2: All Metrics from the Confusion Matrix

MetricFormulaComputationValue
Accuracy0.750
Precision0.714
Recall (Sensitivity)0.625
Specificity0.833
F1 Score0.667
Miss Rate (FNR)0.375
Fall-out (FPR)0.167

Business interpretation for loan default:

  • Precision = 0.714: of 7 loans we flagged as high-risk, 5 were actual defaults. 2 customers were denied loans unnecessarily.
  • Recall = 0.625: of 8 actual defaults, we caught 5. We missed 3 defaulters who received loans and likely won't repay them.
  • Which matters more? A bank losing money on missed defaults (FN) typically cares more about Recall. A customer discrimination lawsuit from false alarms (FP) shifts priority to Precision. The right metric depends on the asymmetry of the business cost.
  • Accuracy = 75% is misleading: if 2% of loans default and you always predict "no default," accuracy = 98%. But recall = 0% — you've detected nothing.

Step 3: The Precision-Recall Tradeoff

As you lower the threshold, you flag more samples as positive (higher recall, lower precision). As you raise it, fewer are flagged (higher precision, lower recall):

ThresholdTPFPFNTNPrecisionRecall
0.375177/12 = 0.5837/8 = 0.875
0.5523105/7 = 0.7145/8 = 0.625
0.7404124/4 = 1.0004/8 = 0.500
0.9206122/2 = 1.0002/8 = 0.250

At threshold 0.7 and 0.9: precision = 1.0 because the only flagged samples are true positives. But recall drops — we're missing more defaulters. At threshold 0.3: catch 7 of 8 defaulters but also flag 5 non-defaulters.

Recall Precision <text x="53" y="222" font-size="8" fill="#64748b">0</text> <text x="408" y="222" font-size="8" fill="#64748b" text-anchor="end">1</text> <text x="44" y="20" font-size="8" fill="#64748b" text-anchor="end">1</text> <text x="44" y="225" font-size="8" fill="#64748b" text-anchor="end">0</text> <path d="M195,225 L195,100 L285,60 L375,18" fill="none" stroke="#3b82f6" stroke-width="2"/> <circle cx="195" cy="100" r="5" fill="#f59e0b"/><text x="190" y="95" font-size="8" fill="#64748b" text-anchor="end">t=0.3</text> <circle cx="231" cy="70" r="5" fill="#f59e0b"/><text x="236" y="65" font-size="8" fill="#64748b">t=0.5</text> <circle cx="285" cy="60" r="5" fill="#f59e0b"/><text x="290" y="55" font-size="8" fill="#64748b">t=0.7</text> <circle cx="375" cy="18" r="5" fill="#f59e0b"/><text x="380" y="13" font-size="8" fill="#64748b">t=0.9</text>

Step 4: ROC Curve and AUC

The ROC curve plots True Positive Rate (Recall) vs False Positive Rate at each threshold:

ThresholdFPR = FP/(FP+TN)TPR = TP/(TP+FN)
1.00/12 = 0.0000/8 = 0.000
0.90/12 = 0.0002/8 = 0.250
0.70/12 = 0.0004/8 = 0.500
0.52/12 = 0.1675/8 = 0.625
0.35/12 = 0.4177/8 = 0.875
0.012/12 = 1.0008/8 = 1.000
python
from sklearn.metrics import roc_auc_score

auc = roc_auc_score(y_true, y_prob)
print(f"AUC-ROC: {auc:.4f}")
AUC-ROC: 0.8750 False Positive Rate (FPR) True Positive Rate (TPR) <line x1="50" y1="225" x2="410" y2="15" stroke="#cbd5e1" stroke-width="1.5" stroke-dasharray="5,4"/> <text x="350" y="90" font-size="9" fill="#94a3b8">random (AUC=0.5)</text> <path d="M50,225 L50,162 L50,120 L110,100 L200,80 L50,225" fill="#dbeafe" fill-opacity="0.4"/> <path d="M50,225 L50,162 L50,120 L110,100 L200,80" fill="none" stroke="#3b82f6" stroke-width="2.5"/> <circle cx="50" cy="225" r="4" fill="#3b82f6"/> <circle cx="50" cy="162" r="4" fill="#3b82f6"/> <circle cx="50" cy="120" r="4" fill="#3b82f6"/> <circle cx="110" cy="100" r="4" fill="#3b82f6"/> <circle cx="200" cy="80" r="4" fill="#3b82f6"/> <circle cx="410" cy="15" r="4" fill="#3b82f6"/> <text x="130" y="170" font-size="11" font-weight="bold" fill="#3b82f6">AUC = 0.875</text>

AUC = 0.875 means: if you randomly pick one defaulter and one non-defaulter from the dataset, there's an 87.5% chance the model assigns a higher probability to the defaulter. AUC is threshold-independent — it measures the model's discriminative ability across all possible thresholds.

Step 5: F1 Score and the Beta-F Score

F1 is the harmonic mean of Precision and Recall:

The harmonic mean is lower than the arithmetic mean () and is dominated by whichever is smaller — a model with Precision=0.99 but Recall=0.10 gets F1=0.18, not a flattering 0.55.

When Recall matters more than Precision (catching defaulters is critical), use with :

(Recall weighted 2× more):

(Precision weighted 2× more):

because low recall is penalized harder. because the model's precision of 0.714 is respectable.

Code Summary

python
from sklearn.metrics import classification_report

print(classification_report(y_true, y_pred, target_names=['No Default', 'Default']))
precision recall f1-score support No Default 0.77 0.83 0.80 12 Default 0.71 0.62 0.67 8 accuracy 0.75 20 macro avg 0.74 0.73 0.73 20 weighted avg 0.75 0.75 0.74 20

Metric Selection Guide

Business QuestionMetric to Use
How often is our model right overall?Accuracy (only if classes are balanced)
Of our flagged loans, how many default?Precision
Of all actual defaults, how many did we catch?Recall
Balance between precision and recall?F1
Catching defaults is critical (FN is costly)? or Recall
Comparing models across thresholds?AUC-ROC
Severe class imbalance (rare defaults)?AUC-PR

The confusion matrix and all derived metrics depend on the chosen threshold. A model with AUC-ROC = 0.875 and F1 = 0.667 at threshold 0.5 might have F1 = 0.75 at threshold 0.35. Always visualize the Precision-Recall and ROC curves before committing to a threshold — the threshold should be chosen by the business cost ratio of FP to FN, not arbitrarily set to 0.5.

AUC-ROC = 0.875 looks strong here, but on a severely imbalanced dataset (99% non-default), a model that always outputs a slightly lower probability for the 1% fraud samples can achieve AUC = 0.95 while still being essentially useless for fraud detection. Post 07 covers this with the AUC-PR metric for imbalanced classification.

Test Your Understanding

  1. The confusion matrix gives TP=5, FP=2, FN=3, TN=10. If the bank loses $50k per missed defaulter (FN) and $5k per false alarm (FP), what is the total expected cost at threshold 0.5? At threshold 0.3 (where TP=7, FP=5, FN=1, TN=7)?

  2. AUC-ROC = 0.875 means an 87.5% chance that a randomly drawn defaulter has a higher predicted probability than a randomly drawn non-defaulter. If you shuffle the predicted probabilities randomly (destroying the model), what would AUC-ROC be?

  3. A model achieves Precision=0.90 and Recall=0.30. The F1 is 0.45. A second model has Precision=0.60 and Recall=0.60. Its F1 is also 0.60. Which model does the harmonic mean favor, and why is this the right choice?

  4. We computed and . As , what value does converge to? As ?

  5. The miss rate (FNR = 0.375) and recall (TPR = 0.625) sum to 1.0. Is this always true? Prove it from the formulas.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment