← View series: machine learning
~/blog
Can Linear Regression Solve Classification?
Before learning logistic regression, you should know exactly why linear regression fails at classification. Not "it's not designed for it" — that's circular. The concrete failure modes: predictions that violate probability bounds, a decision boundary that shifts when you add one outlier, and a loss function that cannot discriminate confidence from indecision.
Anchor dataset: Predict loan default (1 = default, 0 = no default) from income.
import numpy as np
# 8 samples: income ($k), y = 1 if default
X = np.array([25, 32, 45, 60, 75, 85, 95, 110]).reshape(-1, 1)
y = np.array([1, 1, 1, 0, 0, 0, 0, 0])
# Pattern: lower income → defaultNaive Attempt: Linear Regression on a Binary Target
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
print(f"Intercept: {model.intercept_:.4f}")
print(f"Coef: {model.coef_[0]:.6f}")Intercept: 1.6667
Coef: -0.010417
The equation is . Now compute predictions for each sample:
| Income ($k) | Problem | ||
|---|---|---|---|
| 25 | 1 | 1.667 − 0.260 = 1.407 | > 1 — impossible probability |
| 32 | 1 | 1.667 − 0.333 = 1.334 | > 1 |
| 45 | 1 | 1.667 − 0.468 = 1.199 | > 1 |
| 60 | 0 | 1.667 − 0.625 = 1.042 | > 1 and labeled non-default |
| 75 | 0 | 1.667 − 0.781 = 0.886 | Still large for a non-default |
| 85 | 0 | 1.667 − 0.885 = 0.782 | 78% probability? |
| 95 | 0 | 1.667 − 0.989 = 0.678 | — |
| 110 | 0 | 1.667 − 1.145 = 0.522 | — |
The model outputs values above 1.0 for four of the eight samples. A probability cannot exceed 1. And for incomes above $160k (extrapolating further), the model would predict a negative probability — equally meaningless.
<line x1="50" y1="67" x2="510" y2="205" stroke="#3b82f6" stroke-width="2"/>
<line x1="50" y1="110" x2="510" y2="110" stroke="#ef4444" stroke-width="1" stroke-dasharray="4,3"/>
<text x="514" y="113" font-size="9" fill="#ef4444">y=1</text>
<line x1="50" y1="193" x2="510" y2="193" stroke="#ef4444" stroke-width="1" stroke-dasharray="4,3"/>
<text x="514" y="196" font-size="9" fill="#ef4444">y=0</text>
<circle cx="83" cy="193" r="5" fill="#ef4444"/>
<circle cx="113" cy="193" r="5" fill="#ef4444"/>
<circle cx="153" cy="193" r="5" fill="#ef4444"/>
<circle cx="193" cy="110" r="5" fill="#22c55e"/>
<circle cx="233" cy="110" r="5" fill="#22c55e"/>
<circle cx="263" cy="110" r="5" fill="#22c55e"/>
<circle cx="293" cy="110" r="5" fill="#22c55e"/>
<circle cx="353" cy="110" r="5" fill="#22c55e"/>
<text x="514" y="170" font-size="9" fill="#ef4444">invalid</text>
<text x="514" y="180" font-size="9" fill="#ef4444">zone</text>
<text x="55" y="200" font-size="8" fill="#334155">25</text>
<text x="108" y="200" font-size="8" fill="#334155">45</text>
<text x="188" y="200" font-size="8" fill="#334155">75</text>
<text x="348" y="200" font-size="8" fill="#334155">110</text>
<text x="490" y="200" font-size="8" fill="#334155">160</text>
Red dots are defaulters (y=1), green dots are non-defaulters (y=0). The blue regression line crosses above y=1 for low incomes and would cross below y=0 if we extended to very high incomes. The red dashed lines mark the valid probability range.
The Threshold Hack: Round ŷ to 0 or 1
The obvious fix: apply a threshold. If , predict 1; otherwise predict 0.
Decision boundary: →
This means the model predicts default for income < $112.2k — which includes every single one of our 8 samples (max income = $110k).
y_pred_thresh = (model.predict(X) > 0.5).astype(int)
print(y_pred_thresh)
# Prediction for all samples:
# Confusion: TP=3 (defaulters called default), FP=5 (non-defaulters called default)
# TN=0, FN=0
accuracy = (y_pred_thresh == y).mean()
print(f"Accuracy: {accuracy:.4f}")
print(f"Baseline (always predict 0): {(y==0).mean():.4f}")[1 1 1 1 1 1 1 1]
Accuracy: 0.3750
Baseline (always predict 0): 0.6250
The linear regression classifier achieves 37.5% accuracy — worse than always predicting "no default" (62.5%). The decision boundary landed outside the feature range entirely.
The Outlier Sensitivity Problem
Add one outlier: income = $500k, no default. One wealthy customer should not change how we classify the $25k–$110k range.
X_out = np.vstack([X, [[500]]])
y_out = np.append(y, [0])
model_out = LinearRegression()
model_out.fit(X_out, y_out)
print(f"Original coef: -0.010417")
print(f"New coef: {model_out.coef_[0]:.6f}")
# New boundary: 1/(new_slope) scale calculation
new_boundary = (model_out.intercept_ - 0.5) / (-model_out.coef_[0])
print(f"New decision boundary: ${new_boundary:.1f}k")Original coef: -0.010417
New coef: -0.001804
New decision boundary: $51.4k
The boundary shifted from $112k to $51k. Samples at $60k, $75k, $85k, $95k, $110k (all non-defaulters) are now predicted as defaulters. Adding one legitimate outlier corrupted the predictions for five correctly-classified samples.
<rect x="10" y="18" width="260" height="170" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="290" y="18" width="260" height="170" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<line x1="10" y1="100" x2="270" y2="100" stroke="#ef4444" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="290" y1="100" x2="550" y2="100" stroke="#ef4444" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="10" y1="50" x2="270" y2="170" stroke="#3b82f6" stroke-width="2"/>
<text x="200" y="45" font-size="9" fill="#3b82f6">boundary at $112k →</text>
<circle cx="30" cy="175" r="4" fill="#ef4444"/><circle cx="52" cy="170" r="4" fill="#ef4444"/><circle cx="78" cy="163" r="4" fill="#ef4444"/>
<circle cx="108" cy="100" r="4" fill="#22c55e"/><circle cx="133" cy="100" r="4" fill="#22c55e"/>
<circle cx="155" cy="100" r="4" fill="#22c55e"/><circle cx="178" cy="100" r="4" fill="#22c55e"/>
<circle cx="218" cy="100" r="4" fill="#22c55e"/>
<line x1="290" y1="85" x2="520" y2="178" stroke="#3b82f6" stroke-width="2"/>
<text x="300" y="82" font-size="9" fill="#3b82f6">boundary shifts to $51k ↓</text>
<circle cx="310" cy="175" r="4" fill="#ef4444"/><circle cx="332" cy="170" r="4" fill="#ef4444"/><circle cx="358" cy="163" r="4" fill="#ef4444"/>
<circle cx="388" cy="100" r="4" fill="#ef4444" stroke="#ef4444"/>
<circle cx="413" cy="100" r="4" fill="#ef4444" stroke="#ef4444"/>
<circle cx="435" cy="100" r="4" fill="#ef4444" stroke="#ef4444"/>
<circle cx="458" cy="100" r="4" fill="#ef4444" stroke="#ef4444"/>
<circle cx="498" cy="100" r="4" fill="#ef4444" stroke="#ef4444"/>
<text x="370" y="195" font-size="8" fill="#ef4444">wrongly classified ↑</text>
Left panel: correct classification at $112k boundary. Right panel: outlier pulls the regression line down, new boundary at $51k misclassifies five non-defaulters (the green dots now fall in the red zone).
What We Actually Need
The sigmoid function maps any real-valued score to a valid probability:
For any , :
import numpy as np
for z in [-5, -2, 0, 2, 5]:
s = 1 / (1 + np.exp(-z))
print(f"σ({z:+d}) = {s:.4f}")σ(-5) = 0.0067
σ(-2) = 0.1192
σ( 0) = 0.5000
σ(+2) = 0.8808
σ(+5) = 0.9933
The model becomes . The decision boundary is where , which means , which means — a well-defined linear equation regardless of the data range.
The outlier sensitivity is fixed because sigmoid squashes extreme values: an income of $500k produces a very large negative , and regardless of exactly how negative. Adding one extreme outlier slightly adjusts the weights but doesn't destroy the boundary.
The Three Fundamental Problems
-
Range violation: Linear regression outputs can exceed — not interpretable as probabilities. Sigmoid fixes this by construction.
-
Outlier sensitivity: One extreme sample shifts the regression line and displaces the decision boundary, misclassifying an arbitrary number of correctly-handled samples. The sigmoid's saturation at extreme values absorbs outliers gracefully.
-
Wrong loss function: MSE treats the problem as predicting a continuous value. A prediction of 0.999 (correct, confident) and 0.5 (correct, no confidence) have MSE losses of 0.000001 and 0.25 relative to . Binary cross-entropy properly assigns a large loss to confident wrong predictions and grows unboundedly — the gradient signal is strong where the model most needs to correct.
Linear vs Logistic Regression for Classification
| Aspect | Linear Regression | Logistic Regression |
|---|---|---|
| Output range | ||
| Interpretation | Not a probability | |
| Decision boundary | Can be ill-positioned | Always at → linear in |
| Outlier sensitivity | High — one outlier shifts boundary | Low — sigmoid squashes extreme values |
| Loss function | MSE (ignores confidence) | Binary cross-entropy (penalizes wrong confidence) |
Test Your Understanding
-
The threshold at 0.5 moved the decision boundary to $112k, which misclassified the entire dataset. If you lowered the threshold to 0.3, what income would become the new boundary? Would accuracy improve?
-
Adding one outlier shifted the boundary from $112k to $51k, misclassifying 5 samples. How would the boundary shift if the outlier had income = $5000k instead of $500k?
-
always, regardless of the weights. What does this mean about the decision boundary (the income where P(default) = 0.5) in logistic regression, and how does it differ from the linear regression boundary?
-
The three problems are: range violation, outlier sensitivity, and wrong loss. If you replaced MSE with mean absolute error (MAE) for linear regression classification, which problems would remain?
-
If the data were perfectly linearly separable (a clear income gap between all defaulters and non-defaulters), would linear regression with threshold 0.5 give correct predictions? What breaks the approach even in this ideal case?