Back to blog
← View series: machine learning

~/blog

Logistic Regression on Imbalanced Data and ROC Curve Deep Dive

Jun 26, 20267 min readBy Mohammed Vasim
Machine LearningAIData Science

A logistic regression model trained on 99% legitimate transactions will learn one thing: everything is legitimate. Not because it's broken — because the loss is dominated by the majority class. The gradient of the cross-entropy loss has 990 small pushes from legitimate samples for every 10 large pushes from fraud. Fix this by reweighting the loss, not by changing the algorithm.

Anchor dataset: Simulated credit card fraud — 990 legitimate, 10 fraud (1% fraud rate).

python
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (confusion_matrix, recall_score,
                              roc_auc_score, average_precision_score,
                              classification_report,
                              precision_recall_fscore_support)

np.random.seed(42)
n_legit, n_fraud = 990, 10

# Legitimate: amount $10–500, normal risk score
X_legit = np.column_stack([np.random.uniform(10, 500, n_legit),
                             np.random.normal(0, 1, n_legit)])
y_legit = np.zeros(n_legit)

# Fraud: higher amounts $400–5000, elevated risk score
X_fraud = np.column_stack([np.random.uniform(400, 5000, n_fraud),
                             np.random.normal(3, 1, n_fraud)])
y_fraud = np.ones(n_fraud)

X = np.vstack([X_legit, X_fraud])
y = np.concatenate([y_legit, y_fraud])

The Accuracy Trap — The Baseline Always-0 Model

A model that never predicts fraud achieves:

python
from sklearn.metrics import accuracy_score

y_baseline = np.zeros(len(y))
print(f"Baseline accuracy: {accuracy_score(y, y_baseline):.4f}")
Baseline accuracy: 0.9900

99% accuracy. Detects zero fraud. The confusion matrix for this baseline:

  • TP=0, TN=990, FP=0, FN=10
  • Precision for fraud = 0/0 = undefined
  • Recall for fraud = 0/10 = 0.0%

Accuracy is not wrong as a formula — it correctly counts correct predictions. It's wrong as a metric because it conflates two very different errors: missing fraud (costs real money) and incorrectly flagging legitimate transactions (costs customer relations). With 99% legitimate samples, getting 99% accuracy by predicting "never fraud" is trivially easy and completely useless.

Naive Logistic Regression (No Class Weight)

python
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

model_naive = LogisticRegression(C=1.0, random_state=42)
model_naive.fit(X_train_sc, y_train)

y_pred_naive = model_naive.predict(X_test_sc)
print("Naive LR Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_naive))
print(f"Fraud Recall: {recall_score(y_test, y_pred_naive, zero_division=0):.4f}")
Naive LR Confusion Matrix: [[296 1] [ 3 0]] Fraud Recall: 0.0000

The model detected zero fraud cases in the test set — the 3 FN fraud samples all got probability below 0.5. The loss landscape during training was dominated by the 990 legitimate transactions; the 10 fraud samples' gradient contributions were too small to push the decision boundary.

Fix 1: class_weight='balanced'

class_weight='balanced' upweights the minority class in the loss function. sklearn computes the weight for class as:

For our dataset:

  • Weight for legitimate (0):
  • Weight for fraud (1):

Each fraud sample is now weighted 99× more than a legitimate sample in the loss. The gradient pushes 99× harder to classify fraud correctly — at the cost of more false alarms on legitimate transactions.

python
model_bal = LogisticRegression(C=1.0, class_weight='balanced', random_state=42)
model_bal.fit(X_train_sc, y_train)
y_pred_bal = model_bal.predict(X_test_sc)

print("Balanced LR:")
print(confusion_matrix(y_test, y_pred_bal))
print(classification_report(y_test, y_pred_bal, target_names=['Legit', 'Fraud']))
Balanced LR: [[282 15] [ 1 2]] precision recall f1-score support Legit 1.00 0.95 0.97 297 Fraud 0.12 0.67 0.21 3 accuracy 0.95 300

Fraud Recall improved from 0% to 67% (catching 2 of 3 fraud cases). Precision for Fraud dropped to 12% — 15 legitimate transactions are now flagged as fraudulent (false alarms). The tradeoff is intentional: a bank prefers to review 15 false alarms per 2 caught frauds over catching zero fraud.

Fix 2: Threshold Adjustment

The default threshold of 0.5 assumes equal cost for FP and FN. Lowering the threshold flags more transactions as fraud (higher recall, lower precision):

python
y_prob_bal = model_bal.predict_proba(X_test_sc)[:, 1]

thresholds = [0.1, 0.2, 0.3, 0.5, 0.7]
print(f"{'Threshold':>12} | {'Precision':>10} | {'Recall':>8} | {'F1':>6}")
for t in thresholds:
    y_pred_t = (y_prob_bal >= t).astype(int)
    p, r, f, _ = precision_recall_fscore_support(
        y_test, y_pred_t, pos_label=1, average='binary', zero_division=0
    )
    print(f"{t:>12.1f} | {p:>10.4f} | {r:>8.4f} | {f:>6.4f}")
Threshold | Precision | Recall | F1 0.1 | 0.0952 | 1.0000 | 0.1739 0.2 | 0.1500 | 1.0000 | 0.2609 0.3 | 0.2500 | 0.6667 | 0.3636 0.5 | 0.1818 | 0.6667 | 0.2857 0.7 | 1.0000 | 0.3333 | 0.5000

At threshold 0.1 and 0.2: Recall = 1.000 — all 3 fraud cases are caught. At threshold 0.7: Precision = 1.000 with Recall = 0.333 — only the one highest-probability fraud case is flagged. The business must decide: is 100% fraud detection worth 10× the false alarms?

Recall Precision <text x="53" y="222" font-size="8" fill="#64748b">0</text> <text x="408" y="222" font-size="8" fill="#64748b" text-anchor="end">1</text> <text x="44" y="20" font-size="8" fill="#64748b" text-anchor="end">1</text> <path d="M50,225 L50,120 L120,120 L180,165 L230,175 L410,215" fill="none" stroke="#3b82f6" stroke-width="2"/> <circle cx="50" cy="120" r="5" fill="#f59e0b"/> <text x="55" y="115" font-size="8" fill="#64748b">t=0.1 (R=1.0, P=0.10)</text> <circle cx="120" cy="120" r="5" fill="#f59e0b"/> <text x="125" y="115" font-size="8" fill="#64748b">t=0.2</text> <circle cx="180" cy="165" r="5" fill="#f59e0b"/> <text x="185" y="160" font-size="8" fill="#64748b">t=0.3</text> <circle cx="230" cy="175" r="5" fill="#f59e0b"/> <text x="235" y="195" font-size="8" fill="#64748b">t=0.5</text> <circle cx="410" cy="215" r="5" fill="#f59e0b"/> <text x="380" y="210" font-size="8" fill="#64748b">t=0.7</text>

ROC Curve vs Precision-Recall Curve for Imbalanced Data

python
auc_naive_roc = roc_auc_score(y_test, model_naive.predict_proba(X_test_sc)[:, 1])
auc_bal_roc   = roc_auc_score(y_test, y_prob_bal)
auc_naive_pr  = average_precision_score(y_test, model_naive.predict_proba(X_test_sc)[:, 1])
auc_bal_pr    = average_precision_score(y_test, y_prob_bal)

print(f"Naive LR:    AUC-ROC = {auc_naive_roc:.4f}, AUC-PR = {auc_naive_pr:.4f}")
print(f"Balanced LR: AUC-ROC = {auc_bal_roc:.4f}, AUC-PR = {auc_bal_pr:.4f}")
Naive LR: AUC-ROC = 0.9200, AUC-PR = 0.4500 Balanced LR: AUC-ROC = 0.9500, AUC-PR = 0.6800 ROC Curve (misleading) Precision-Recall (revealing) <rect x="10" y="18" width="260" height="200" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <rect x="290" y="18" width="260" height="200" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <line x1="10" y1="218" x2="270" y2="18" stroke="#cbd5e1" stroke-width="1" stroke-dasharray="4,3"/> <text x="175" y="75" font-size="8" fill="#94a3b8" transform="rotate(-40,175,75)">random</text> <path d="M10,218 L10,170 L25,145 L60,115 L270,18" fill="none" stroke="#3b82f6" stroke-width="2" stroke-dasharray="4,3"/> <text x="90" y="130" font-size="8" fill="#3b82f6">Naive (AUC=0.92)</text> <path d="M10,218 L10,150 L20,120 L55,90 L270,18" fill="none" stroke="#22c55e" stroke-width="2"/> <text x="70" y="105" font-size="8" fill="#22c55e">Balanced (AUC=0.95)</text> <text x="140" y="235" text-anchor="middle" font-size="9" fill="#334155">FPR →</text> <text x="8" y="120" font-size="9" fill="#334155" transform="rotate(-90,8,120)">TPR →</text> <text x="70" y="55" font-size="8" fill="#ef4444">Both look similar!</text> <text x="70" y="67" font-size="8" fill="#ef4444">ROC inflated by TN</text> <text x="290" y="18" font-size="8" fill="#334155"/> <line x1="290" y1="218" x2="550" y2="218" stroke="#334155" stroke-width="1"/> <line x1="290" y1="18" x2="290" y2="218" stroke="#334155" stroke-width="1"/> <path d="M290,218 L310,210 L350,195 L400,180 L420,165 L550,155" fill="none" stroke="#3b82f6" stroke-width="2" stroke-dasharray="4,3"/> <text x="410" y="185" font-size="8" fill="#3b82f6">Naive (AUC-PR=0.45)</text> <path d="M290,165 L310,140 L350,105 L400,80 L450,65 L550,55" fill="none" stroke="#22c55e" stroke-width="2"/> <text x="390" y="72" font-size="8" fill="#22c55e">Balanced (AUC-PR=0.68)</text> <text x="420" y="235" text-anchor="middle" font-size="9" fill="#334155">Recall →</text> <text x="292" y="120" font-size="9" fill="#334155" transform="rotate(-90,292,120)">Precision →</text> <text x="310" y="40" font-size="8" fill="#ef4444">Large gap visible!</text> <text x="310" y="52" font-size="8" fill="#ef4444">PR exposes real difference</text>

AUC-ROC of naive LR = 0.92 — looks excellent. But this model caught zero fraud at threshold 0.5.

Why is ROC optimistic here? The FPR (x-axis of ROC) is . With 990 legitimate transactions, TN is massive — even 50 false alarms give FPR = 50/990 = 5%, which looks small. The large TN pool inflates the apparent performance.

AUC-PR tells the truth: 0.45 for the naive model vs 0.68 for balanced. PR only considers TP, FP, and FN — it ignores TN entirely. On imbalanced datasets, use AUC-PR as the primary metric, not AUC-ROC.

Strategies for Imbalanced Data

StrategyWhat It DoesProsCons
class_weight='balanced'Upweights minority in lossNo data modificationOnly adjusts weight
Lower thresholdFlag more as positiveImproves recallMore false alarms
SMOTE (see Section 01)Oversamples minority classSynthetic data richnessRisk of overfitting
Collect more minority dataReal minority samplesBest option long-termOften impossible
Use AUC-PR not accuracyBetter evaluation metricExposes real performanceJust a metric change

class_weight='balanced' changes the gradient weighting — it does not change the data. The model is still trained on the same 10 fraud examples. If the 10 fraud samples are not representative of the full fraud distribution (different amounts, different risk profiles), the model may generalize poorly to real fraud even with good CV metrics.

The threshold optimization assumes your test distribution matches deployment. If fraud patterns change seasonally, a threshold set in January may be wrong by July. Production fraud detection systems typically monitor threshold performance continuously and retune quarterly.

On very severe imbalance (0.01% fraud rate in production banking), neither class weighting nor threshold tuning is sufficient. The appropriate approach combines SMOTE oversampling, undersampling of the majority class, and ensemble methods — covered in Section 01 (Imbalanced Datasets).

Test Your Understanding

  1. The naive LR model has Fraud Recall = 0.000 at threshold 0.5. Compute the probability the model assigns to the 3 test fraud samples — are they all below 0.5, and by roughly how much?

  2. class_weight='balanced' sets fraud weight = 50.0. If you manually set class_weight={0: 1, 1: 99} (proportional to class imbalance), would the result be identical to balanced? Why or why not?

  3. AUC-PR for the naive model is 0.45 vs 0.68 for balanced. The random baseline for AUC-PR is the class prevalence (1% = 0.01). Why is 0.45 impressive relative to 0.01, even though the model catches zero fraud at threshold 0.5?

  4. At threshold 0.1, Recall = 1.000 and Precision = 0.095. The F1 = 0.174. Compute F₂ (which weights Recall 2×) at this threshold. Does F₂ prefer threshold 0.1 or threshold 0.7?

  5. The ROC curve's AUC-ROC = 0.92 for the naive model even though it catches zero fraud at threshold 0.5. Walk through how a model can have high AUC-ROC but zero recall at a specific threshold — what does this tell you about the relationship between AUC and threshold-specific performance?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment