← View series: machine learning
~/blog
Logistic Regression: Full Implementation
Theory is complete. This post runs logistic regression end-to-end on a real binary classification dataset: data loading, preprocessing, fitting, evaluation, hyperparameter sensitivity, solver comparison, and deployment. Every number is verifiable.
Anchor dataset: Breast Cancer Wisconsin — predict malignant (0) vs benign (1) from 30 tumor features.
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target # 0=malignant, 1=benign
print(f"Shape: {df.shape}")
print(df['target'].value_counts())
print(df[['mean radius', 'mean texture', 'mean area']].describe().round(2))Shape: (569, 31)
1 357
0 212
Name: target, dtype: int64
mean radius mean texture mean area
count 569.00 569.00 569.00
mean 14.13 19.29 654.89
std 3.52 4.30 351.91
min 6.98 9.71 143.50
max 28.11 39.28 2501.00
Class distribution: 212 malignant (37.3%) and 357 benign (62.7%) — mild imbalance, but not severe enough to require special treatment for this dataset. Feature ranges span orders of magnitude (radius ~7–28, area ~143–2501), making StandardScaler essential.
Train/Test Split and Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
print(f"Train class ratio: {y_train.mean():.3f}")
print(f"Test class ratio: {y_test.mean():.3f}")Train: (455, 30), Test: (114, 30)
Train class ratio: 0.627
Test class ratio: 0.632
stratify=y ensures both splits preserve the 63/37 class ratio. Without stratify, a random split could produce a test set with a different ratio — inflating or deflating all class-conditional metrics.
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)The scaler is fit on training data only — test set is transformed using training mean and std.
Fitting the Model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1.0, solver='lbfgs', max_iter=10000, random_state=42)
model.fit(X_train_sc, y_train)
print(f"Intercept: {model.intercept_[0]:.4f}")
coef_series = pd.Series(np.abs(model.coef_[0]), index=data.feature_names)
print("\nTop 5 coefficients by magnitude:")
print(coef_series.nlargest(5).round(4))Intercept: 0.3524
Top 5 coefficients by magnitude:
worst concave points 2.8731
worst perimeter 2.3142
mean concave points 1.9823
worst radius 1.7156
mean area 1.5601
The largest absolute coefficients identify the most influential features. worst concave points = 2.87 means a 1 standard deviation increase in this feature changes the log-odds of benign by 2.87 — the strongest signal in the model.
Predictions and Probabilities
y_pred = model.predict(X_test_sc)
y_prob = model.predict_proba(X_test_sc)[:, 1] # P(benign)
for i in range(5):
print(f"Sample {i}: P(benign)={y_prob[i]:.4f}, predicted={y_pred[i]}, true={y_test[i]}")Sample 0: P(benign)=0.9823, predicted=1, true=1
Sample 1: P(benign)=0.0041, predicted=0, true=0
Sample 2: P(benign)=0.8916, predicted=1, true=1
Sample 3: P(benign)=0.9991, predicted=1, true=1
Sample 4: P(benign)=0.0018, predicted=0, true=0
The model is confident on all 5 shown samples — probabilities near 0 or 1. predict_proba()[:,1] extracts the probability of the positive class (benign = 1).
Full Evaluation
from sklearn.metrics import (confusion_matrix, classification_report,
roc_auc_score, average_precision_score)
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print()
print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"AUC-PR: {average_precision_score(y_test, y_prob):.4f}")Confusion Matrix:
[[40 2]
[ 1 71]]
precision recall f1-score support
Malignant 0.98 0.95 0.96 42
Benign 0.97 0.99 0.98 72
accuracy 0.97 114
AUC-ROC: 0.9981
AUC-PR: 0.9989
Interpreting the confusion matrix:
- TP = 71: benign tumors correctly identified as benign
- TN = 40: malignant tumors correctly identified as malignant
- FP = 2: benign tumors misclassified as malignant (unnecessary biopsy)
- FN = 1: malignant tumor called benign — the most serious error
FN = 1 is the critical failure: one cancer patient would receive a false all-clear. Medical screening systems tune the threshold toward higher recall (lower threshold), accepting more FP to ensure near-zero FN.
Hyperparameter: Regularization Strength C
C = 1/λ in sklearn. Smaller C = stronger regularization. Note the inverse convention from Ridge/Lasso:
from sklearn.model_selection import cross_val_score
C_values = [0.001, 0.01, 0.1, 1, 10, 100]
for C in C_values:
m = LogisticRegression(C=C, solver='lbfgs', max_iter=10000)
scores = cross_val_score(m, X_train_sc, y_train, cv=5, scoring='roc_auc')
print(f"C={C:>7}: CV AUC = {scores.mean():.4f} ± {scores.std():.4f}")C= 0.001: CV AUC = 0.9841 ± 0.0082
C= 0.01: CV AUC = 0.9944 ± 0.0049
C= 0.1: CV AUC = 0.9970 ± 0.0030
C= 1: CV AUC = 0.9975 ± 0.0031
C= 10: CV AUC = 0.9974 ± 0.0030
C= 100: CV AUC = 0.9973 ± 0.0030
<text x="60" y="200" font-size="8" fill="#64748b" text-anchor="middle">0.001</text>
<text x="128" y="200" font-size="8" fill="#64748b" text-anchor="middle">0.01</text>
<text x="196" y="200" font-size="8" fill="#64748b" text-anchor="middle">0.1</text>
<text x="265" y="200" font-size="8" fill="#64748b" text-anchor="middle">1</text>
<text x="333" y="200" font-size="8" fill="#64748b" text-anchor="middle">10</text>
<text x="402" y="200" font-size="8" fill="#64748b" text-anchor="middle">100</text>
<polyline points="60,130 128,80 196,38 265,25 333,27 402,28" fill="none" stroke="#3b82f6" stroke-width="2.5"/>
<circle cx="60" cy="130" r="4" fill="#3b82f6"/>
<circle cx="128" cy="80" r="4" fill="#3b82f6"/>
<circle cx="196" cy="38" r="4" fill="#3b82f6"/>
<circle cx="265" cy="25" r="5" fill="#22c55e" stroke="#22c55e"/>
<circle cx="333" cy="27" r="4" fill="#3b82f6"/>
<circle cx="402" cy="28" r="4" fill="#3b82f6"/>
<line x1="265" y1="15" x2="265" y2="185" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,3"/>
<text x="270" y="50" font-size="9" fill="#22c55e">optimal C=1</text>
<text x="65" y="135" font-size="8" fill="#64748b">0.984</text>
<text x="133" y="85" font-size="8" fill="#64748b">0.994</text>
<text x="200" y="43" font-size="8" fill="#64748b">0.997</text>
<text x="246" y="20" font-size="8" fill="#22c55e">0.9975</text>
At C=0.001 (strong regularization): AUC drops to 0.984 — the model is over-constrained. At C=1–100: performance plateaus — the data is well-conditioned and very little additional regularization helps. Optimal C≈1 (sklearn default).
Solver Options
Different solvers implement the same logistic regression objective but differ in numerical method and supported penalties:
for solver in ['lbfgs', 'liblinear', 'saga']:
m = LogisticRegression(solver=solver, max_iter=10000, random_state=42)
m.fit(X_train_sc, y_train)
acc = m.score(X_test_sc, y_test)
print(f"solver={solver:12s}: accuracy={acc:.4f}")solver=lbfgs : accuracy=0.9737
solver=liblinear : accuracy=0.9737
solver=saga : accuracy=0.9737
All three converge to the same solution on this well-conditioned dataset. Choose based on dataset size: lbfgs for small-medium (default), saga for large datasets or ElasticNet penalty, liblinear for L1 penalty on small datasets.
Saving and Loading
import joblib
# Save both model and scaler — they must travel together
joblib.dump({'model': model, 'scaler': scaler}, 'breast_cancer_lr.pkl')
# Load and infer on a new sample
bundle = joblib.load('breast_cancer_lr.pkl')
new_sample = X_test[:1] # use one test sample for demo
new_scaled = bundle['scaler'].transform(new_sample)
prob = bundle['model'].predict_proba(new_scaled)[0, 1]
print(f"P(benign) = {prob:.4f}")P(benign) = 0.9823
Always save the scaler alongside the model. If you load the model and apply a new or unfitted scaler, predictions will be wrong — the model was trained on features with specific mean and std, and those must be reproduced at inference time.
End-to-End Pipeline Summary
| Step | Code | Key Output |
|---|---|---|
| Load data | load_breast_cancer() | (569, 30) features |
| Split (stratified) | train_test_split(stratify=y) | 455 train, 114 test |
| Scale | StandardScaler | mean≈0, std≈1 per feature |
| Fit | LogisticRegression(C=1) | intercept + 30 coefficients |
| Evaluate | confusion matrix + AUC | AUC-ROC=0.998, acc=0.97 |
| Tune C | cross_val_score | Optimal C≈1 |
| Deploy | joblib.dump/load | Sub-ms inference |
Related Concepts and Honest Limitations
AUC-ROC = 0.9981 and accuracy = 97% on this dataset are exceptionally high — the 30 features include many strong predictors of cancer (concave points, perimeter, area). Real clinical datasets are harder. This implementation also skips one important step for clinical use: calibration. The model's predict_proba output of 0.98 should genuinely mean 98% of samples with that score are benign — sklearn.calibration.CalibratedClassifierCV checks and corrects this.
The FN=1 error highlights the gap between ML metrics and deployment risk. In a 10,000-sample clinical screen, 1% FNR means 100 missed cancers. The decision threshold should be set by the acceptable FNR for the use case — not by whatever maximizes accuracy or F1.
Test Your Understanding
-
The model uses C=1.0. The AUC at C=0.001 is 0.9841 and at C=1 is 0.9975. The difference is 0.013 AUC points. Is this clinically meaningful for a cancer screening model? What metric would you focus on instead?
-
stratify=ywas used in the split. Suppose the original dataset has 212 malignant and 357 benign. After an 80/20 split with stratify, exactly how many malignant and benign samples are in the test set? -
FN=1 in the test set (1 malignant predicted as benign). If you lower the threshold from 0.5 to 0.2, what happens to FN and FP? Which confusion matrix entries could increase and which could only decrease?
-
The
sagasolver andlbfgssolver both achieve 97.37% accuracy but use different optimization algorithms. If training data doubled to 1,000 samples, which solver would you prefer and why? -
A colleague saves only the model (without the scaler) to
model.pkl. At inference time, they apply the raw features directly. Qualitatively, what happens to the model's predictions?