Back to blog
← View series: machine learning

~/blog

Can Linear Regression Solve Classification?

Jun 26, 20267 min readBy Mohammed Vasim
Machine LearningAIData Science

Before learning logistic regression, you should know exactly why linear regression fails at classification. Not "it's not designed for it" — that's circular. The concrete failure modes: predictions that violate probability bounds, a decision boundary that shifts when you add one outlier, and a loss function that cannot discriminate confidence from indecision.

Anchor dataset: Predict loan default (1 = default, 0 = no default) from income.

python
import numpy as np

# 8 samples: income ($k), y = 1 if default
X = np.array([25, 32, 45, 60, 75, 85, 95, 110]).reshape(-1, 1)
y = np.array([1,   1,  1,  0,  0,  0,  0,   0])
# Pattern: lower income → default

Naive Attempt: Linear Regression on a Binary Target

python
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)
print(f"Intercept: {model.intercept_:.4f}")
print(f"Coef:      {model.coef_[0]:.6f}")
Intercept: 1.6667 Coef: -0.010417

The equation is . Now compute predictions for each sample:

Income ($k)Problem
2511.667 − 0.260 = 1.407> 1 — impossible probability
3211.667 − 0.333 = 1.334> 1
4511.667 − 0.468 = 1.199> 1
6001.667 − 0.625 = 1.042> 1 and labeled non-default
7501.667 − 0.781 = 0.886Still large for a non-default
8501.667 − 0.885 = 0.78278% probability?
9501.667 − 0.989 = 0.678
11001.667 − 1.145 = 0.522

The model outputs values above 1.0 for four of the eight samples. A probability cannot exceed 1. And for incomes above $160k (extrapolating further), the model would predict a negative probability — equally meaningless.

Income ($k) ŷ <line x1="50" y1="67" x2="510" y2="205" stroke="#3b82f6" stroke-width="2"/> <line x1="50" y1="110" x2="510" y2="110" stroke="#ef4444" stroke-width="1" stroke-dasharray="4,3"/> <text x="514" y="113" font-size="9" fill="#ef4444">y=1</text> <line x1="50" y1="193" x2="510" y2="193" stroke="#ef4444" stroke-width="1" stroke-dasharray="4,3"/> <text x="514" y="196" font-size="9" fill="#ef4444">y=0</text> <circle cx="83" cy="193" r="5" fill="#ef4444"/> <circle cx="113" cy="193" r="5" fill="#ef4444"/> <circle cx="153" cy="193" r="5" fill="#ef4444"/> <circle cx="193" cy="110" r="5" fill="#22c55e"/> <circle cx="233" cy="110" r="5" fill="#22c55e"/> <circle cx="263" cy="110" r="5" fill="#22c55e"/> <circle cx="293" cy="110" r="5" fill="#22c55e"/> <circle cx="353" cy="110" r="5" fill="#22c55e"/> <text x="514" y="170" font-size="9" fill="#ef4444">invalid</text> <text x="514" y="180" font-size="9" fill="#ef4444">zone</text> <text x="55" y="200" font-size="8" fill="#334155">25</text> <text x="108" y="200" font-size="8" fill="#334155">45</text> <text x="188" y="200" font-size="8" fill="#334155">75</text> <text x="348" y="200" font-size="8" fill="#334155">110</text> <text x="490" y="200" font-size="8" fill="#334155">160</text>

Red dots are defaulters (y=1), green dots are non-defaulters (y=0). The blue regression line crosses above y=1 for low incomes and would cross below y=0 if we extended to very high incomes. The red dashed lines mark the valid probability range.

The Threshold Hack: Round ŷ to 0 or 1

The obvious fix: apply a threshold. If , predict 1; otherwise predict 0.

Decision boundary:

This means the model predicts default for income < $112.2k — which includes every single one of our 8 samples (max income = $110k).

python
y_pred_thresh = (model.predict(X) > 0.5).astype(int)
print(y_pred_thresh)
# Prediction for all samples:
# Confusion: TP=3 (defaulters called default), FP=5 (non-defaulters called default)
# TN=0, FN=0
accuracy = (y_pred_thresh == y).mean()
print(f"Accuracy: {accuracy:.4f}")
print(f"Baseline (always predict 0): {(y==0).mean():.4f}")
[1 1 1 1 1 1 1 1] Accuracy: 0.3750 Baseline (always predict 0): 0.6250

The linear regression classifier achieves 37.5% accuracy — worse than always predicting "no default" (62.5%). The decision boundary landed outside the feature range entirely.

The Outlier Sensitivity Problem

Add one outlier: income = $500k, no default. One wealthy customer should not change how we classify the $25k–$110k range.

python
X_out = np.vstack([X, [[500]]])
y_out = np.append(y, [0])

model_out = LinearRegression()
model_out.fit(X_out, y_out)
print(f"Original coef: -0.010417")
print(f"New coef:      {model_out.coef_[0]:.6f}")

# New boundary: 1/(new_slope) scale calculation
new_boundary = (model_out.intercept_ - 0.5) / (-model_out.coef_[0])
print(f"New decision boundary: ${new_boundary:.1f}k")
Original coef: -0.010417 New coef: -0.001804 New decision boundary: $51.4k

The boundary shifted from $112k to $51k. Samples at $60k, $75k, $85k, $95k, $110k (all non-defaulters) are now predicted as defaulters. Adding one legitimate outlier corrupted the predictions for five correctly-classified samples.

Without outlier With outlier (income=$500k) <rect x="10" y="18" width="260" height="170" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <rect x="290" y="18" width="260" height="170" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <line x1="10" y1="100" x2="270" y2="100" stroke="#ef4444" stroke-width="1" stroke-dasharray="3,2"/> <line x1="290" y1="100" x2="550" y2="100" stroke="#ef4444" stroke-width="1" stroke-dasharray="3,2"/> <line x1="10" y1="50" x2="270" y2="170" stroke="#3b82f6" stroke-width="2"/> <text x="200" y="45" font-size="9" fill="#3b82f6">boundary at $112k →</text> <circle cx="30" cy="175" r="4" fill="#ef4444"/><circle cx="52" cy="170" r="4" fill="#ef4444"/><circle cx="78" cy="163" r="4" fill="#ef4444"/> <circle cx="108" cy="100" r="4" fill="#22c55e"/><circle cx="133" cy="100" r="4" fill="#22c55e"/> <circle cx="155" cy="100" r="4" fill="#22c55e"/><circle cx="178" cy="100" r="4" fill="#22c55e"/> <circle cx="218" cy="100" r="4" fill="#22c55e"/> <line x1="290" y1="85" x2="520" y2="178" stroke="#3b82f6" stroke-width="2"/> <text x="300" y="82" font-size="9" fill="#3b82f6">boundary shifts to $51k ↓</text> <circle cx="310" cy="175" r="4" fill="#ef4444"/><circle cx="332" cy="170" r="4" fill="#ef4444"/><circle cx="358" cy="163" r="4" fill="#ef4444"/> <circle cx="388" cy="100" r="4" fill="#ef4444" stroke="#ef4444"/> <circle cx="413" cy="100" r="4" fill="#ef4444" stroke="#ef4444"/> <circle cx="435" cy="100" r="4" fill="#ef4444" stroke="#ef4444"/> <circle cx="458" cy="100" r="4" fill="#ef4444" stroke="#ef4444"/> <circle cx="498" cy="100" r="4" fill="#ef4444" stroke="#ef4444"/> <text x="370" y="195" font-size="8" fill="#ef4444">wrongly classified ↑</text>

Left panel: correct classification at $112k boundary. Right panel: outlier pulls the regression line down, new boundary at $51k misclassifies five non-defaulters (the green dots now fall in the red zone).

What We Actually Need

The sigmoid function maps any real-valued score to a valid probability:

For any , :

python
import numpy as np

for z in [-5, -2, 0, 2, 5]:
    s = 1 / (1 + np.exp(-z))
    print(f"σ({z:+d}) = {s:.4f}")
σ(-5) = 0.0067 σ(-2) = 0.1192 σ( 0) = 0.5000 σ(+2) = 0.8808 σ(+5) = 0.9933

The model becomes . The decision boundary is where , which means , which means — a well-defined linear equation regardless of the data range.

The outlier sensitivity is fixed because sigmoid squashes extreme values: an income of $500k produces a very large negative , and regardless of exactly how negative. Adding one extreme outlier slightly adjusts the weights but doesn't destroy the boundary.

The Three Fundamental Problems

  1. Range violation: Linear regression outputs can exceed — not interpretable as probabilities. Sigmoid fixes this by construction.

  2. Outlier sensitivity: One extreme sample shifts the regression line and displaces the decision boundary, misclassifying an arbitrary number of correctly-handled samples. The sigmoid's saturation at extreme values absorbs outliers gracefully.

  3. Wrong loss function: MSE treats the problem as predicting a continuous value. A prediction of 0.999 (correct, confident) and 0.5 (correct, no confidence) have MSE losses of 0.000001 and 0.25 relative to . Binary cross-entropy properly assigns a large loss to confident wrong predictions and grows unboundedly — the gradient signal is strong where the model most needs to correct.

Linear vs Logistic Regression for Classification

AspectLinear RegressionLogistic Regression
Output range
InterpretationNot a probability
Decision boundaryCan be ill-positionedAlways at → linear in
Outlier sensitivityHigh — one outlier shifts boundaryLow — sigmoid squashes extreme values
Loss functionMSE (ignores confidence)Binary cross-entropy (penalizes wrong confidence)

Test Your Understanding

  1. The threshold at 0.5 moved the decision boundary to $112k, which misclassified the entire dataset. If you lowered the threshold to 0.3, what income would become the new boundary? Would accuracy improve?

  2. Adding one outlier shifted the boundary from $112k to $51k, misclassifying 5 samples. How would the boundary shift if the outlier had income = $5000k instead of $500k?

  3. always, regardless of the weights. What does this mean about the decision boundary (the income where P(default) = 0.5) in logistic regression, and how does it differ from the linear regression boundary?

  4. The three problems are: range violation, outlier sensitivity, and wrong loss. If you replaced MSE with mean absolute error (MAE) for linear regression classification, which problems would remain?

  5. If the data were perfectly linearly separable (a clear income gap between all defaulters and non-defaulters), would linear regression with threshold 0.5 give correct predictions? What breaks the approach even in this ideal case?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment