Back to blog
← View series: machine learning

~/blog

Logistic Regression: Full Implementation

Jun 26, 20266 min readBy Mohammed Vasim
Machine LearningAIData Science

Theory is complete. This post runs logistic regression end-to-end on a real binary classification dataset: data loading, preprocessing, fitting, evaluation, hyperparameter sensitivity, solver comparison, and deployment. Every number is verifiable.

Anchor dataset: Breast Cancer Wisconsin — predict malignant (0) vs benign (1) from 30 tumor features.

python
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target  # 0=malignant, 1=benign

print(f"Shape: {df.shape}")
print(df['target'].value_counts())
print(df[['mean radius', 'mean texture', 'mean area']].describe().round(2))
Shape: (569, 31) 1 357 0 212 Name: target, dtype: int64 mean radius mean texture mean area count 569.00 569.00 569.00 mean 14.13 19.29 654.89 std 3.52 4.30 351.91 min 6.98 9.71 143.50 max 28.11 39.28 2501.00

Class distribution: 212 malignant (37.3%) and 357 benign (62.7%) — mild imbalance, but not severe enough to require special treatment for this dataset. Feature ranges span orders of magnitude (radius ~7–28, area ~143–2501), making StandardScaler essential.

Train/Test Split and Preprocessing

python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
print(f"Train class ratio: {y_train.mean():.3f}")
print(f"Test class ratio:  {y_test.mean():.3f}")
Train: (455, 30), Test: (114, 30) Train class ratio: 0.627 Test class ratio: 0.632

stratify=y ensures both splits preserve the 63/37 class ratio. Without stratify, a random split could produce a test set with a different ratio — inflating or deflating all class-conditional metrics.

python
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

The scaler is fit on training data only — test set is transformed using training mean and std.

Fitting the Model

python
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=1.0, solver='lbfgs', max_iter=10000, random_state=42)
model.fit(X_train_sc, y_train)

print(f"Intercept: {model.intercept_[0]:.4f}")
coef_series = pd.Series(np.abs(model.coef_[0]), index=data.feature_names)
print("\nTop 5 coefficients by magnitude:")
print(coef_series.nlargest(5).round(4))
Intercept: 0.3524 Top 5 coefficients by magnitude: worst concave points 2.8731 worst perimeter 2.3142 mean concave points 1.9823 worst radius 1.7156 mean area 1.5601

The largest absolute coefficients identify the most influential features. worst concave points = 2.87 means a 1 standard deviation increase in this feature changes the log-odds of benign by 2.87 — the strongest signal in the model.

Predictions and Probabilities

python
y_pred = model.predict(X_test_sc)
y_prob = model.predict_proba(X_test_sc)[:, 1]  # P(benign)

for i in range(5):
    print(f"Sample {i}: P(benign)={y_prob[i]:.4f}, predicted={y_pred[i]}, true={y_test[i]}")
Sample 0: P(benign)=0.9823, predicted=1, true=1 Sample 1: P(benign)=0.0041, predicted=0, true=0 Sample 2: P(benign)=0.8916, predicted=1, true=1 Sample 3: P(benign)=0.9991, predicted=1, true=1 Sample 4: P(benign)=0.0018, predicted=0, true=0

The model is confident on all 5 shown samples — probabilities near 0 or 1. predict_proba()[:,1] extracts the probability of the positive class (benign = 1).

Full Evaluation

python
from sklearn.metrics import (confusion_matrix, classification_report,
                              roc_auc_score, average_precision_score)

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print()
print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"AUC-PR:  {average_precision_score(y_test, y_prob):.4f}")
Confusion Matrix: [[40 2] [ 1 71]] precision recall f1-score support Malignant 0.98 0.95 0.96 42 Benign 0.97 0.99 0.98 72 accuracy 0.97 114 AUC-ROC: 0.9981 AUC-PR: 0.9989

Interpreting the confusion matrix:

  • TP = 71: benign tumors correctly identified as benign
  • TN = 40: malignant tumors correctly identified as malignant
  • FP = 2: benign tumors misclassified as malignant (unnecessary biopsy)
  • FN = 1: malignant tumor called benign — the most serious error

FN = 1 is the critical failure: one cancer patient would receive a false all-clear. Medical screening systems tune the threshold toward higher recall (lower threshold), accepting more FP to ensure near-zero FN.

Hyperparameter: Regularization Strength C

C = 1/λ in sklearn. Smaller C = stronger regularization. Note the inverse convention from Ridge/Lasso:

python
from sklearn.model_selection import cross_val_score

C_values = [0.001, 0.01, 0.1, 1, 10, 100]
for C in C_values:
    m = LogisticRegression(C=C, solver='lbfgs', max_iter=10000)
    scores = cross_val_score(m, X_train_sc, y_train, cv=5, scoring='roc_auc')
    print(f"C={C:>7}: CV AUC = {scores.mean():.4f} ± {scores.std():.4f}")
C= 0.001: CV AUC = 0.9841 ± 0.0082 C= 0.01: CV AUC = 0.9944 ± 0.0049 C= 0.1: CV AUC = 0.9970 ± 0.0030 C= 1: CV AUC = 0.9975 ± 0.0031 C= 10: CV AUC = 0.9974 ± 0.0030 C= 100: CV AUC = 0.9973 ± 0.0030 C (log scale) CV AUC-ROC <text x="60" y="200" font-size="8" fill="#64748b" text-anchor="middle">0.001</text> <text x="128" y="200" font-size="8" fill="#64748b" text-anchor="middle">0.01</text> <text x="196" y="200" font-size="8" fill="#64748b" text-anchor="middle">0.1</text> <text x="265" y="200" font-size="8" fill="#64748b" text-anchor="middle">1</text> <text x="333" y="200" font-size="8" fill="#64748b" text-anchor="middle">10</text> <text x="402" y="200" font-size="8" fill="#64748b" text-anchor="middle">100</text> <polyline points="60,130 128,80 196,38 265,25 333,27 402,28" fill="none" stroke="#3b82f6" stroke-width="2.5"/> <circle cx="60" cy="130" r="4" fill="#3b82f6"/> <circle cx="128" cy="80" r="4" fill="#3b82f6"/> <circle cx="196" cy="38" r="4" fill="#3b82f6"/> <circle cx="265" cy="25" r="5" fill="#22c55e" stroke="#22c55e"/> <circle cx="333" cy="27" r="4" fill="#3b82f6"/> <circle cx="402" cy="28" r="4" fill="#3b82f6"/> <line x1="265" y1="15" x2="265" y2="185" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,3"/> <text x="270" y="50" font-size="9" fill="#22c55e">optimal C=1</text> <text x="65" y="135" font-size="8" fill="#64748b">0.984</text> <text x="133" y="85" font-size="8" fill="#64748b">0.994</text> <text x="200" y="43" font-size="8" fill="#64748b">0.997</text> <text x="246" y="20" font-size="8" fill="#22c55e">0.9975</text>

At C=0.001 (strong regularization): AUC drops to 0.984 — the model is over-constrained. At C=1–100: performance plateaus — the data is well-conditioned and very little additional regularization helps. Optimal C≈1 (sklearn default).

Solver Options

Different solvers implement the same logistic regression objective but differ in numerical method and supported penalties:

python
for solver in ['lbfgs', 'liblinear', 'saga']:
    m = LogisticRegression(solver=solver, max_iter=10000, random_state=42)
    m.fit(X_train_sc, y_train)
    acc = m.score(X_test_sc, y_test)
    print(f"solver={solver:12s}: accuracy={acc:.4f}")
solver=lbfgs : accuracy=0.9737 solver=liblinear : accuracy=0.9737 solver=saga : accuracy=0.9737

All three converge to the same solution on this well-conditioned dataset. Choose based on dataset size: lbfgs for small-medium (default), saga for large datasets or ElasticNet penalty, liblinear for L1 penalty on small datasets.

Saving and Loading

python
import joblib

# Save both model and scaler — they must travel together
joblib.dump({'model': model, 'scaler': scaler}, 'breast_cancer_lr.pkl')

# Load and infer on a new sample
bundle = joblib.load('breast_cancer_lr.pkl')
new_sample = X_test[:1]                                        # use one test sample for demo
new_scaled = bundle['scaler'].transform(new_sample)
prob = bundle['model'].predict_proba(new_scaled)[0, 1]
print(f"P(benign) = {prob:.4f}")
P(benign) = 0.9823

Always save the scaler alongside the model. If you load the model and apply a new or unfitted scaler, predictions will be wrong — the model was trained on features with specific mean and std, and those must be reproduced at inference time.

End-to-End Pipeline Summary

StepCodeKey Output
Load dataload_breast_cancer()(569, 30) features
Split (stratified)train_test_split(stratify=y)455 train, 114 test
ScaleStandardScalermean≈0, std≈1 per feature
FitLogisticRegression(C=1)intercept + 30 coefficients
Evaluateconfusion matrix + AUCAUC-ROC=0.998, acc=0.97
Tune Ccross_val_scoreOptimal C≈1
Deployjoblib.dump/loadSub-ms inference

AUC-ROC = 0.9981 and accuracy = 97% on this dataset are exceptionally high — the 30 features include many strong predictors of cancer (concave points, perimeter, area). Real clinical datasets are harder. This implementation also skips one important step for clinical use: calibration. The model's predict_proba output of 0.98 should genuinely mean 98% of samples with that score are benign — sklearn.calibration.CalibratedClassifierCV checks and corrects this.

The FN=1 error highlights the gap between ML metrics and deployment risk. In a 10,000-sample clinical screen, 1% FNR means 100 missed cancers. The decision threshold should be set by the acceptable FNR for the use case — not by whatever maximizes accuracy or F1.

Test Your Understanding

  1. The model uses C=1.0. The AUC at C=0.001 is 0.9841 and at C=1 is 0.9975. The difference is 0.013 AUC points. Is this clinically meaningful for a cancer screening model? What metric would you focus on instead?

  2. stratify=y was used in the split. Suppose the original dataset has 212 malignant and 357 benign. After an 80/20 split with stratify, exactly how many malignant and benign samples are in the test set?

  3. FN=1 in the test set (1 malignant predicted as benign). If you lower the threshold from 0.5 to 0.2, what happens to FN and FP? Which confusion matrix entries could increase and which could only decrease?

  4. The saga solver and lbfgs solver both achieve 97.37% accuracy but use different optimization algorithms. If training data doubled to 1,000 samples, which solver would you prefer and why?

  5. A colleague saves only the model (without the scaler) to model.pkl. At inference time, they apply the raw features directly. Qualitatively, what happens to the model's predictions?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment