Back to blog
← View series: machine learning

Can Linear Regression Solve Classification?Logistic Regression: Math Intuition Classification Performance Metrics Multiclass Logistic Regression: OvR (One vs Rest)Logistic Regression: Full Implementation GridSearchCV and RandomizedSearchCV Logistic Regression on Imbalanced Data and ROC Curve Deep Dive

~/blog

Logistic Regression on Imbalanced Data and ROC Curve Deep Dive

Jun 26, 2026•7 min read•By Mohammed Vasim

Machine LearningAIData Science

A logistic regression model trained on 99% legitimate transactions will learn one thing: everything is legitimate. Not because it's broken — because the loss is dominated by the majority class. The gradient of the cross-entropy loss has 990 small pushes from legitimate samples for every 10 large pushes from fraud. Fix this by reweighting the loss, not by changing the algorithm.

Anchor dataset: Simulated credit card fraud — 990 legitimate, 10 fraud (1% fraud rate).

python

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (confusion_matrix, recall_score,
                              roc_auc_score, average_precision_score,
                              classification_report,
                              precision_recall_fscore_support)

np.random.seed(42)
n_legit, n_fraud = 990, 10

# Legitimate: amount $10–500, normal risk score
X_legit = np.column_stack([np.random.uniform(10, 500, n_legit),
                             np.random.normal(0, 1, n_legit)])
y_legit = np.zeros(n_legit)

# Fraud: higher amounts $400–5000, elevated risk score
X_fraud = np.column_stack([np.random.uniform(400, 5000, n_fraud),
                             np.random.normal(3, 1, n_fraud)])
y_fraud = np.ones(n_fraud)

X = np.vstack([X_legit, X_fraud])
y = np.concatenate([y_legit, y_fraud])

The Accuracy Trap — The Baseline Always-0 Model

A model that never predicts fraud achieves:

python

from sklearn.metrics import accuracy_score

y_baseline = np.zeros(len(y))
print(f"Baseline accuracy: {accuracy_score(y, y_baseline):.4f}")

Baseline accuracy: 0.9900

99% accuracy. Detects zero fraud. The confusion matrix for this baseline:

TP=0, TN=990, FP=0, FN=10
Precision for fraud = 0/0 = undefined
Recall for fraud = 0/10 = 0.0%

Accuracy is not wrong as a formula — it correctly counts correct predictions. It's wrong as a metric because it conflates two very different errors: missing fraud (costs real money) and incorrectly flagging legitimate transactions (costs customer relations). With 99% legitimate samples, getting 99% accuracy by predicting "never fraud" is trivially easy and completely useless.

Naive Logistic Regression (No Class Weight)

python

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

model_naive = LogisticRegression(C=1.0, random_state=42)
model_naive.fit(X_train_sc, y_train)

y_pred_naive = model_naive.predict(X_test_sc)
print("Naive LR Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_naive))
print(f"Fraud Recall: {recall_score(y_test, y_pred_naive, zero_division=0):.4f}")

Naive LR Confusion Matrix:
[[296   1]
 [  3   0]]
Fraud Recall: 0.0000

The model detected zero fraud cases in the test set — the 3 FN fraud samples all got probability below 0.5. The loss landscape during training was dominated by the 990 legitimate transactions; the 10 fraud samples' gradient contributions were too small to push the decision boundary.

Fix 1: class_weight='balanced'

class_weight='balanced' upweights the minority class in the loss function. sklearn computes the weight for class $k$ as:

$weight_{k} = \frac{n _{samples}}{n _{classes} \times n _{samples_{k}}}$

For our dataset:

Weight for legitimate (0): $1000/ (2 \times 990) = 0.505$
Weight for fraud (1): $1000/ (2 \times 10) = 50.0$

Each fraud sample is now weighted 99× more than a legitimate sample in the loss. The gradient pushes 99× harder to classify fraud correctly — at the cost of more false alarms on legitimate transactions.

python

model_bal = LogisticRegression(C=1.0, class_weight='balanced', random_state=42)
model_bal.fit(X_train_sc, y_train)
y_pred_bal = model_bal.predict(X_test_sc)

print("Balanced LR:")
print(confusion_matrix(y_test, y_pred_bal))
print(classification_report(y_test, y_pred_bal, target_names=['Legit', 'Fraud']))

Balanced LR:
[[282  15]
 [  1   2]]

              precision    recall  f1-score   support
       Legit       1.00      0.95      0.97       297
       Fraud       0.12      0.67      0.21         3

    accuracy                           0.95       300

Fraud Recall improved from 0% to 67% (catching 2 of 3 fraud cases). Precision for Fraud dropped to 12% — 15 legitimate transactions are now flagged as fraudulent (false alarms). The tradeoff is intentional: a bank prefers to review 15 false alarms per 2 caught frauds over catching zero fraud.

Fix 2: Threshold Adjustment

The default threshold of 0.5 assumes equal cost for FP and FN. Lowering the threshold flags more transactions as fraud (higher recall, lower precision):

python

y_prob_bal = model_bal.predict_proba(X_test_sc)[:, 1]

thresholds = [0.1, 0.2, 0.3, 0.5, 0.7]
print(f"{'Threshold':>12} | {'Precision':>10} | {'Recall':>8} | {'F1':>6}")
for t in thresholds:
    y_pred_t = (y_prob_bal >= t).astype(int)
    p, r, f, _ = precision_recall_fscore_support(
        y_test, y_pred_t, pos_label=1, average='binary', zero_division=0
    )
    print(f"{t:>12.1f} | {p:>10.4f} | {r:>8.4f} | {f:>6.4f}")

   Threshold | Precision |   Recall |     F1
         0.1 |    0.0952 |   1.0000 | 0.1739
         0.2 |    0.1500 |   1.0000 | 0.2609
         0.3 |    0.2500 |   0.6667 | 0.3636
         0.5 |    0.1818 |   0.6667 | 0.2857
         0.7 |    1.0000 |   0.3333 | 0.5000

At threshold 0.1 and 0.2: Recall = 1.000 — all 3 fraud cases are caught. At threshold 0.7: Precision = 1.000 with Recall = 0.333 — only the one highest-probability fraud case is flagged. The business must decide: is 100% fraud detection worth 10× the false alarms?

<text x="53" y="222" font-size="8" fill="#64748b">0</text>
<text x="408" y="222" font-size="8" fill="#64748b" text-anchor="end">1</text>
<text x="44" y="20" font-size="8" fill="#64748b" text-anchor="end">1</text>

<path d="M50,225 L50,120 L120,120 L180,165 L230,175 L410,215" fill="none" stroke="#3b82f6" stroke-width="2"/>

<circle cx="50" cy="120" r="5" fill="#f59e0b"/>
<text x="55" y="115" font-size="8" fill="#64748b">t=0.1 (R=1.0, P=0.10)</text>
<circle cx="120" cy="120" r="5" fill="#f59e0b"/>
<text x="125" y="115" font-size="8" fill="#64748b">t=0.2</text>
<circle cx="180" cy="165" r="5" fill="#f59e0b"/>
<text x="185" y="160" font-size="8" fill="#64748b">t=0.3</text>
<circle cx="230" cy="175" r="5" fill="#f59e0b"/>
<text x="235" y="195" font-size="8" fill="#64748b">t=0.5</text>
<circle cx="410" cy="215" r="5" fill="#f59e0b"/>
<text x="380" y="210" font-size="8" fill="#64748b">t=0.7</text>

ROC Curve vs Precision-Recall Curve for Imbalanced Data

python

auc_naive_roc = roc_auc_score(y_test, model_naive.predict_proba(X_test_sc)[:, 1])
auc_bal_roc   = roc_auc_score(y_test, y_prob_bal)
auc_naive_pr  = average_precision_score(y_test, model_naive.predict_proba(X_test_sc)[:, 1])
auc_bal_pr    = average_precision_score(y_test, y_prob_bal)

print(f"Naive LR:    AUC-ROC = {auc_naive_roc:.4f}, AUC-PR = {auc_naive_pr:.4f}")
print(f"Balanced LR: AUC-ROC = {auc_bal_roc:.4f}, AUC-PR = {auc_bal_pr:.4f}")

Naive LR:    AUC-ROC = 0.9200, AUC-PR = 0.4500
Balanced LR: AUC-ROC = 0.9500, AUC-PR = 0.6800

<rect x="10" y="18" width="260" height="200" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="290" y="18" width="260" height="200" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>

<line x1="10" y1="218" x2="270" y2="18" stroke="#cbd5e1" stroke-width="1" stroke-dasharray="4,3"/>
<text x="175" y="75" font-size="8" fill="#94a3b8" transform="rotate(-40,175,75)">random</text>

<path d="M10,218 L10,170 L25,145 L60,115 L270,18" fill="none" stroke="#3b82f6" stroke-width="2" stroke-dasharray="4,3"/>
<text x="90" y="130" font-size="8" fill="#3b82f6">Naive (AUC=0.92)</text>

<path d="M10,218 L10,150 L20,120 L55,90 L270,18" fill="none" stroke="#22c55e" stroke-width="2"/>
<text x="70" y="105" font-size="8" fill="#22c55e">Balanced (AUC=0.95)</text>

<text x="140" y="235" text-anchor="middle" font-size="9" fill="#334155">FPR →</text>
<text x="8" y="120" font-size="9" fill="#334155" transform="rotate(-90,8,120)">TPR →</text>

<text x="70" y="55" font-size="8" fill="#ef4444">Both look similar!</text>
<text x="70" y="67" font-size="8" fill="#ef4444">ROC inflated by TN</text>

<text x="290" y="18" font-size="8" fill="#334155"/>

<line x1="290" y1="218" x2="550" y2="218" stroke="#334155" stroke-width="1"/>
<line x1="290" y1="18" x2="290" y2="218" stroke="#334155" stroke-width="1"/>

<path d="M290,218 L310,210 L350,195 L400,180 L420,165 L550,155" fill="none" stroke="#3b82f6" stroke-width="2" stroke-dasharray="4,3"/>
<text x="410" y="185" font-size="8" fill="#3b82f6">Naive (AUC-PR=0.45)</text>

<path d="M290,165 L310,140 L350,105 L400,80 L450,65 L550,55" fill="none" stroke="#22c55e" stroke-width="2"/>
<text x="390" y="72" font-size="8" fill="#22c55e">Balanced (AUC-PR=0.68)</text>

<text x="420" y="235" text-anchor="middle" font-size="9" fill="#334155">Recall →</text>
<text x="292" y="120" font-size="9" fill="#334155" transform="rotate(-90,292,120)">Precision →</text>

<text x="310" y="40" font-size="8" fill="#ef4444">Large gap visible!</text>
<text x="310" y="52" font-size="8" fill="#ef4444">PR exposes real difference</text>

AUC-ROC of naive LR = 0.92 — looks excellent. But this model caught zero fraud at threshold 0.5.

Why is ROC optimistic here? The FPR (x-axis of ROC) is $F P / (F P + T N)$ . With 990 legitimate transactions, TN is massive — even 50 false alarms give FPR = 50/990 = 5%, which looks small. The large TN pool inflates the apparent performance.

AUC-PR tells the truth: 0.45 for the naive model vs 0.68 for balanced. PR only considers TP, FP, and FN — it ignores TN entirely. On imbalanced datasets, use AUC-PR as the primary metric, not AUC-ROC.

Strategies for Imbalanced Data

Strategy	What It Does	Pros	Cons
`class_weight='balanced'`	Upweights minority in loss	No data modification	Only adjusts weight
Lower threshold	Flag more as positive	Improves recall	More false alarms
SMOTE (see Section 01)	Oversamples minority class	Synthetic data richness	Risk of overfitting
Collect more minority data	Real minority samples	Best option long-term	Often impossible
Use AUC-PR not accuracy	Better evaluation metric	Exposes real performance	Just a metric change

class_weight='balanced' changes the gradient weighting — it does not change the data. The model is still trained on the same 10 fraud examples. If the 10 fraud samples are not representative of the full fraud distribution (different amounts, different risk profiles), the model may generalize poorly to real fraud even with good CV metrics.

The threshold optimization assumes your test distribution matches deployment. If fraud patterns change seasonally, a threshold set in January may be wrong by July. Production fraud detection systems typically monitor threshold performance continuously and retune quarterly.

On very severe imbalance (0.01% fraud rate in production banking), neither class weighting nor threshold tuning is sufficient. The appropriate approach combines SMOTE oversampling, undersampling of the majority class, and ensemble methods — covered in Section 01 (Imbalanced Datasets).

Test Your Understanding

The naive LR model has Fraud Recall = 0.000 at threshold 0.5. Compute the probability the model assigns to the 3 test fraud samples — are they all below 0.5, and by roughly how much?
class_weight='balanced' sets fraud weight = 50.0. If you manually set class_weight={0: 1, 1: 99} (proportional to class imbalance), would the result be identical to balanced? Why or why not?
AUC-PR for the naive model is 0.45 vs 0.68 for balanced. The random baseline for AUC-PR is the class prevalence (1% = 0.01). Why is 0.45 impressive relative to 0.01, even though the model catches zero fraud at threshold 0.5?
At threshold 0.1, Recall = 1.000 and Precision = 0.095. The F1 = 0.174. Compute F₂ (which weights Recall 2×) at this threshold. Does F₂ prefer threshold 0.1 or threshold 0.7?
The ROC curve's AUC-ROC = 0.92 for the naive model even though it catches zero fraud at threshold 0.5. Walk through how a model can have high AUC-ROC but zero recall at a specific threshold — what does this tell you about the relationship between AUC and threshold-specific performance?

Logistic Regression on Imbalanced Data and ROC Curve Deep Dive

The Accuracy Trap — The Baseline Always-0 Model

Naive Logistic Regression (No Class Weight)

Fix 1: class_weight='balanced'

Fix 2: Threshold Adjustment

ROC Curve vs Precision-Recall Curve for Imbalanced Data

Strategies for Imbalanced Data

Test Your Understanding

Comments (0)

Leave a comment

Logistic Regression on Imbalanced Data and ROC Curve Deep Dive

The Accuracy Trap — The Baseline Always-0 Model

Naive Logistic Regression (No Class Weight)

Fix 1: class_weight='balanced'

Fix 2: Threshold Adjustment

ROC Curve vs Precision-Recall Curve for Imbalanced Data

Strategies for Imbalanced Data

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment