Back to blog
← View series: machine learning

Can Linear Regression Solve Classification?Logistic Regression: Math Intuition Classification Performance Metrics Multiclass Logistic Regression: OvR (One vs Rest)Logistic Regression: Full Implementation GridSearchCV and RandomizedSearchCV Logistic Regression on Imbalanced Data and ROC Curve Deep Dive

~/blog

Classification Performance Metrics

Jun 26, 2026•7 min read•By Mohammed Vasim

Machine LearningAIData Science

Accuracy is one number. Classification produces four outcomes. The entire story of model quality — who you're catching, who you're missing, and what false alarms cost — lives in the confusion matrix and the curves derived from it. This post computes every metric by hand on a loan default dataset, then shows when each metric misleads.

Anchor dataset: 20-sample loan default predictions from a logistic regression model.

python

import numpy as np

y_true = np.array([1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0])
y_prob = np.array([0.95,0.89,0.78,0.71,0.65,0.42,0.38,0.22,
                   0.81,0.63,0.45,0.41,0.35,0.28,0.19,0.15,0.11,0.08,0.05,0.03])
# 8 actual defaults, 12 actual non-defaults

Step 1: Building the Confusion Matrix at Threshold 0.5

Apply threshold 0.5 manually:

Predicted positive (ŷ=1): y_prob > 0.5 → samples with prob [0.95, 0.89, 0.78, 0.71, 0.65, 0.81, 0.63] = 7 samples
- True defaults among them: [0.95✓, 0.89✓, 0.78✓, 0.71✓, 0.65✓] = 5 TP
- True non-defaults: [0.81✗, 0.63✗] = 2 FP
Predicted negative (ŷ=0): 13 samples
- True defaults missed: [0.42, 0.38, 0.22] = 3 FN
- True non-defaults: 10 TN

python

from sklearn.metrics import confusion_matrix

y_pred = (y_prob >= 0.5).astype(int)
cm = confusion_matrix(y_true, y_pred)
print(cm)

[[10  2]
 [ 3  5]]

<text x="230" y="42" text-anchor="middle" font-size="10" fill="#334155" font-weight="bold">Predicted</text>
<text x="180" y="60" text-anchor="middle" font-size="10" fill="#334155">No Default (0)</text>
<text x="280" y="60" text-anchor="middle" font-size="10" fill="#334155">Default (1)</text>

<text x="60" y="120" text-anchor="middle" font-size="10" fill="#334155" font-weight="bold" transform="rotate(-90,60,120)">Actual</text>
<text x="90" y="110" text-anchor="middle" font-size="10" fill="#334155">No Default (0)</text>
<text x="90" y="175" text-anchor="middle" font-size="10" fill="#334155">Default (1)</text>

<rect x="140" y="70" width="90" height="80" fill="#dcfce7" stroke="#22c55e" stroke-width="2" rx="4"/>
<text x="185" y="108" text-anchor="middle" font-size="22" font-weight="bold" fill="#16a34a">10</text>
<text x="185" y="124" text-anchor="middle" font-size="9" fill="#16a34a">TN</text>

<rect x="230" y="70" width="90" height="80" fill="#fee2e2" stroke="#ef4444" stroke-width="2" rx="4"/>
<text x="275" y="108" text-anchor="middle" font-size="22" font-weight="bold" fill="#dc2626">2</text>
<text x="275" y="124" text-anchor="middle" font-size="9" fill="#dc2626">FP (False Alarm)</text>

<rect x="140" y="150" width="90" height="80" fill="#fef3c7" stroke="#f59e0b" stroke-width="2" rx="4"/>
<text x="185" y="188" text-anchor="middle" font-size="22" font-weight="bold" fill="#d97706">3</text>
<text x="185" y="204" text-anchor="middle" font-size="9" fill="#d97706">FN (Missed)</text>

<rect x="230" y="150" width="90" height="80" fill="#dcfce7" stroke="#22c55e" stroke-width="2" rx="4"/>
<text x="275" y="188" text-anchor="middle" font-size="22" font-weight="bold" fill="#16a34a">5</text>
<text x="275" y="204" text-anchor="middle" font-size="9" fill="#16a34a">TP</text>

Step 2: All Metrics from the Confusion Matrix

Metric	Formula	Computation	Value
Accuracy	$(T P + T N) / (T P + T N + F P + F N)$	$(5 + 10) /20$	0.750
Precision	$T P / (T P + F P)$	$5/ (5 + 2)$	0.714
Recall (Sensitivity)	$T P / (T P + F N)$	$5/ (5 + 3)$	0.625
Specificity	$T N / (T N + F P)$	$10/ (10 + 2)$	0.833
F1 Score	$2 P \cdot R / (P + R)$	$2 \times 0.714 \times 0.625/1.339$	0.667
Miss Rate (FNR)	$F N / (F N + T P)$	$3/ (3 + 5)$	0.375
Fall-out (FPR)	$F P / (F P + T N)$	$2/ (2 + 10)$	0.167

Business interpretation for loan default:

Precision = 0.714: of 7 loans we flagged as high-risk, 5 were actual defaults. 2 customers were denied loans unnecessarily.
Recall = 0.625: of 8 actual defaults, we caught 5. We missed 3 defaulters who received loans and likely won't repay them.
Which matters more? A bank losing money on missed defaults (FN) typically cares more about Recall. A customer discrimination lawsuit from false alarms (FP) shifts priority to Precision. The right metric depends on the asymmetry of the business cost.
Accuracy = 75% is misleading: if 2% of loans default and you always predict "no default," accuracy = 98%. But recall = 0% — you've detected nothing.

Step 3: The Precision-Recall Tradeoff

As you lower the threshold, you flag more samples as positive (higher recall, lower precision). As you raise it, fewer are flagged (higher precision, lower recall):

Threshold	TP	FP	FN	TN	Precision	Recall
0.3	7	5	1	7	7/12 = 0.583	7/8 = 0.875
0.5	5	2	3	10	5/7 = 0.714	5/8 = 0.625
0.7	4	0	4	12	4/4 = 1.000	4/8 = 0.500
0.9	2	0	6	12	2/2 = 1.000	2/8 = 0.250

At threshold 0.7 and 0.9: precision = 1.0 because the only flagged samples are true positives. But recall drops — we're missing more defaulters. At threshold 0.3: catch 7 of 8 defaulters but also flag 5 non-defaulters.

<text x="53" y="222" font-size="8" fill="#64748b">0</text>
<text x="408" y="222" font-size="8" fill="#64748b" text-anchor="end">1</text>
<text x="44" y="20" font-size="8" fill="#64748b" text-anchor="end">1</text>
<text x="44" y="225" font-size="8" fill="#64748b" text-anchor="end">0</text>

<path d="M195,225 L195,100 L285,60 L375,18" fill="none" stroke="#3b82f6" stroke-width="2"/>

<circle cx="195" cy="100" r="5" fill="#f59e0b"/><text x="190" y="95" font-size="8" fill="#64748b" text-anchor="end">t=0.3</text>
<circle cx="231" cy="70" r="5" fill="#f59e0b"/><text x="236" y="65" font-size="8" fill="#64748b">t=0.5</text>
<circle cx="285" cy="60" r="5" fill="#f59e0b"/><text x="290" y="55" font-size="8" fill="#64748b">t=0.7</text>
<circle cx="375" cy="18" r="5" fill="#f59e0b"/><text x="380" y="13" font-size="8" fill="#64748b">t=0.9</text>

Step 4: ROC Curve and AUC

The ROC curve plots True Positive Rate (Recall) vs False Positive Rate at each threshold:

Threshold	FPR = FP/(FP+TN)	TPR = TP/(TP+FN)
1.0	0/12 = 0.000	0/8 = 0.000
0.9	0/12 = 0.000	2/8 = 0.250
0.7	0/12 = 0.000	4/8 = 0.500
0.5	2/12 = 0.167	5/8 = 0.625
0.3	5/12 = 0.417	7/8 = 0.875
0.0	12/12 = 1.000	8/8 = 1.000

python

from sklearn.metrics import roc_auc_score

auc = roc_auc_score(y_true, y_prob)
print(f"AUC-ROC: {auc:.4f}")

AUC-ROC: 0.8750

<line x1="50" y1="225" x2="410" y2="15" stroke="#cbd5e1" stroke-width="1.5" stroke-dasharray="5,4"/>
<text x="350" y="90" font-size="9" fill="#94a3b8">random (AUC=0.5)</text>

<path d="M50,225 L50,162 L50,120 L110,100 L200,80 L50,225" fill="#dbeafe" fill-opacity="0.4"/>
<path d="M50,225 L50,162 L50,120 L110,100 L200,80" fill="none" stroke="#3b82f6" stroke-width="2.5"/>

<circle cx="50" cy="225" r="4" fill="#3b82f6"/>
<circle cx="50" cy="162" r="4" fill="#3b82f6"/>
<circle cx="50" cy="120" r="4" fill="#3b82f6"/>
<circle cx="110" cy="100" r="4" fill="#3b82f6"/>
<circle cx="200" cy="80" r="4" fill="#3b82f6"/>
<circle cx="410" cy="15" r="4" fill="#3b82f6"/>

<text x="130" y="170" font-size="11" font-weight="bold" fill="#3b82f6">AUC = 0.875</text>

AUC = 0.875 means: if you randomly pick one defaulter and one non-defaulter from the dataset, there's an 87.5% chance the model assigns a higher probability to the defaulter. AUC is threshold-independent — it measures the model's discriminative ability across all possible thresholds.

Step 5: F1 Score and the Beta-F Score

F1 is the harmonic mean of Precision and Recall:

$F_{1} = \frac{2 \times P \times R}{P + R} = \frac{2 \times 0.714 \times 0.625}{0.714 + 0.625} = \frac{0.893}{1.339} = 0.667$

The harmonic mean is lower than the arithmetic mean ( $(0.714 + 0.625) /2 = 0.670$ ) and is dominated by whichever is smaller — a model with Precision=0.99 but Recall=0.10 gets F1=0.18, not a flattering 0.55.

When Recall matters more than Precision (catching defaulters is critical), use $F_{β}$ with $β > 1$ :

$F_{β} = (1 + β^{2}) \cdot \frac{P \times R}{β ^{2} \times P + R}$

$F_{2}$ (Recall weighted 2× more):

$F_{2} = \frac{5 \times 0.714 \times 0.625}{4 \times 0.714 + 0.625} = \frac{2.232}{3.481} = 0.641$

$F_{0.5}$ (Precision weighted 2× more):

$F_{0.5} = \frac{1.25 \times 0.714 \times 0.625}{0.25 \times 0.714 + 0.625} = \frac{0.558}{0.804} = 0.694$

$F_{2} = 0.641 < F_{1} = 0.667$ because low recall is penalized harder. $F_{0.5} = 0.694 > F_{1}$ because the model's precision of 0.714 is respectable.

Code Summary

python

from sklearn.metrics import classification_report

print(classification_report(y_true, y_pred, target_names=['No Default', 'Default']))

              precision    recall  f1-score   support

  No Default       0.77      0.83      0.80        12
     Default       0.71      0.62      0.67         8

    accuracy                           0.75        20
   macro avg       0.74      0.73      0.73        20
weighted avg       0.75      0.75      0.74        20

Metric Selection Guide

Business Question	Metric to Use
How often is our model right overall?	Accuracy (only if classes are balanced)
Of our flagged loans, how many default?	Precision
Of all actual defaults, how many did we catch?	Recall
Balance between precision and recall?	F1
Catching defaults is critical (FN is costly)?	$F_{2}$ or Recall
Comparing models across thresholds?	AUC-ROC
Severe class imbalance (rare defaults)?	AUC-PR

The confusion matrix and all derived metrics depend on the chosen threshold. A model with AUC-ROC = 0.875 and F1 = 0.667 at threshold 0.5 might have F1 = 0.75 at threshold 0.35. Always visualize the Precision-Recall and ROC curves before committing to a threshold — the threshold should be chosen by the business cost ratio of FP to FN, not arbitrarily set to 0.5.

AUC-ROC = 0.875 looks strong here, but on a severely imbalanced dataset (99% non-default), a model that always outputs a slightly lower probability for the 1% fraud samples can achieve AUC = 0.95 while still being essentially useless for fraud detection. Post 07 covers this with the AUC-PR metric for imbalanced classification.

Test Your Understanding

The confusion matrix gives TP=5, FP=2, FN=3, TN=10. If the bank loses $50k per missed defaulter (FN) and $5k per false alarm (FP), what is the total expected cost at threshold 0.5? At threshold 0.3 (where TP=7, FP=5, FN=1, TN=7)?
AUC-ROC = 0.875 means an 87.5% chance that a randomly drawn defaulter has a higher predicted probability than a randomly drawn non-defaulter. If you shuffle the predicted probabilities randomly (destroying the model), what would AUC-ROC be?
A model achieves Precision=0.90 and Recall=0.30. The F1 is 0.45. A second model has Precision=0.60 and Recall=0.60. Its F1 is also 0.60. Which model does the harmonic mean favor, and why is this the right choice?
We computed $F_{2} = 0.641$ and $F_{0.5} = 0.694$ . As $β \to \infty$ , what value does $F_{β}$ converge to? As $β \to 0$ ?
The miss rate (FNR = 0.375) and recall (TPR = 0.625) sum to 1.0. Is this always true? Prove it from the formulas.

Classification Performance Metrics

Step 1: Building the Confusion Matrix at Threshold 0.5

Step 2: All Metrics from the Confusion Matrix

Step 3: The Precision-Recall Tradeoff

Step 4: ROC Curve and AUC

Step 5: F1 Score and the Beta-F Score

Code Summary

Metric Selection Guide

Test Your Understanding

Comments (0)

Leave a comment

Classification Performance Metrics

Step 1: Building the Confusion Matrix at Threshold 0.5

Step 2: All Metrics from the Confusion Matrix

Step 3: The Precision-Recall Tradeoff

Step 4: ROC Curve and AUC

Step 5: F1 Score and the Beta-F Score

Code Summary

Metric Selection Guide

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment