← View series: machine learning
~/blog
Multiclass Logistic Regression: OvR (One vs Rest)
Binary logistic regression outputs . With 3 classes you can't use one model — you need a strategy. One-vs-Rest (OvR) decomposes the 3-class problem into 3 binary problems, one per class. Each classifier asks: "Is this sample in my class, or not?" The final prediction is the class with the highest confidence. The catch: the raw probability outputs from 3 independent classifiers don't sum to 1.
Anchor dataset: Iris flowers — petal length and petal width classify 3 species.
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
# Classes: 0=Setosa, 1=Versicolor, 2=Virginica
# Features used: petal_length (col 2), petal_width (col 3)
# 6-sample hand-trace subset (2 per class)
X_trace = np.array([
[1.4, 0.2], # Setosa
[1.5, 0.4], # Setosa
[4.7, 1.4], # Versicolor
[4.5, 1.5], # Versicolor
[6.1, 2.3], # Virginica
[5.8, 1.8], # Virginica
])
y_trace = np.array([0, 0, 1, 1, 2, 2])The Multiclass Problem
Logistic regression is binary by design: it models versus . Two extension strategies:
- One-vs-Rest (OvR): train binary classifiers (one per class). Each is fit independently. At prediction time, run all and pick the class with highest confidence.
- Softmax (Multinomial): train one joint classifier that directly outputs probabilities summing to 1.
OvR is simpler, works with any binary classifier, and is sklearn's default for logistic regression.
Training 3 Binary Classifiers
For each class, relabel the 6-sample anchor: the class of interest becomes 1, all others become 0.
Classifier 1 — Setosa vs {Versicolor, Virginica}:
| Sample | petal_l | petal_w | |
|---|---|---|---|
| Setosa-1 | 1.4 | 0.2 | 1 |
| Setosa-2 | 1.5 | 0.4 | 1 |
| Versicolor-1 | 4.7 | 1.4 | 0 |
| Versicolor-2 | 4.5 | 1.5 | 0 |
| Virginica-1 | 6.1 | 2.3 | 0 |
| Virginica-2 | 5.8 | 1.8 | 0 |
Classifiers 2 and 3 use the same table with the y column relabeled: Versicolor=1 for Classifier 2, Virginica=1 for Classifier 3.
Approximate weights learned by sklearn (stated, not hand-derived):
- Classifier 1 (Setosa):
- Classifier 2 (Versicolor):
- Classifier 3 (Virginica):
Per-Class Probability Trace
Compute and for each classifier on sample Versicolor-1 (petal_l=4.7, petal_w=1.4):
| Classifier | computation | ||
|---|---|---|---|
| Setosa | −25.1 | ||
| Versicolor | 2.95 | 0.950 | |
| Virginica | 8.45 |
Decision: argmax of [0.000, 0.950, 1.000] → Virginica (wrong — true class is Versicolor).
OvR's Key Weakness: Probabilities Don't Sum to 1
The three sigmoid values sum to , not 1. Each classifier is trained independently without knowledge of the others — there's no constraint enforcing that the probabilities are collectively coherent.
Sklearn normalizes by dividing each by the sum:
Final prediction: Virginica (0.513 > 0.487). Even after normalization, Versicolor-1 is still misclassified — Classifier 3 (Virginica) assigns a score of 1.000 because the Virginica vs {Setosa, Versicolor} boundary places many Versicolor samples on the Virginica side.
Full 6-sample prediction table:
| Sample | Setosa | Versicolor | Virginica | Prediction | True Class |
|---|---|---|---|---|---|
| Setosa-1 (1.4, 0.2) | ≈1.000 | ≈0.001 | ≈0.000 | Setosa | 0 ✓ |
| Setosa-2 (1.5, 0.4) | ≈0.999 | ≈0.003 | ≈0.000 | Setosa | 0 ✓ |
| Versicolor-1 (4.7, 1.4) | ≈0.000 | 0.950 | ≈1.000 | Virginica | 1 ✗ |
| Versicolor-2 (4.5, 1.5) | ≈0.000 | 0.920 | ≈0.999 | Virginica | 1 ✗ |
| Virginica-1 (6.1, 2.3) | ≈0.000 | ≈0.010 | ≈1.000 | Virginica | 2 ✓ |
| Virginica-2 (5.8, 1.8) | ≈0.000 | ≈0.030 | ≈0.999 | Virginica | 2 ✓ |
Versicolor is the hard class — its petal dimensions overlap with Virginica. The 2D feature space (petal_l vs petal_w) doesn't fully separate these two species, and the Virginica classifier (trained to detect anything that's not Setosa or Versicolor) picks up large-petal Versicolor samples.
<polygon points="50,245 50,15 190,15 190,245" fill="#dbeafe" fill-opacity="0.5"/>
<polygon points="190,245 190,15 350,15 350,245" fill="#dcfce7" fill-opacity="0.5"/>
<polygon points="350,245 350,15 470,15 470,245" fill="#fef3c7" fill-opacity="0.5"/>
<line x1="190" y1="15" x2="190" y2="245" stroke="#3b82f6" stroke-width="1.5" stroke-dasharray="4,3"/>
<line x1="350" y1="15" x2="350" y2="245" stroke="#22c55e" stroke-width="1.5" stroke-dasharray="4,3"/>
<text x="120" y="240" text-anchor="middle" font-size="10" fill="#3b82f6" font-weight="bold">Setosa</text>
<text x="270" y="240" text-anchor="middle" font-size="10" fill="#22c55e" font-weight="bold">Versicolor</text>
<text x="410" y="240" text-anchor="middle" font-size="10" fill="#f59e0b" font-weight="bold">Virginica</text>
<circle cx="75" cy="226" r="6" fill="#3b82f6"/>
<circle cx="82" cy="218" r="6" fill="#3b82f6"/>
<circle cx="310" cy="176" r="6" fill="#22c55e"/>
<circle cx="300" cy="170" r="6" fill="#22c55e"/>
<circle cx="390" cy="110" r="6" fill="#f59e0b"/>
<circle cx="372" cy="140" r="6" fill="#f59e0b"/>
<circle cx="310" cy="176" r="8" fill="none" stroke="#ef4444" stroke-width="2"/>
<text x="321" y="175" font-size="8" fill="#ef4444">✗</text>
<circle cx="300" cy="170" r="8" fill="none" stroke="#ef4444" stroke-width="2"/>
<text x="311" y="169" font-size="8" fill="#ef4444">✗</text>
The two Versicolor samples (circled in red) sit in the Virginica decision region because Classifier 3 draws a boundary that encloses large-petal samples regardless of species.
sklearn OvR Implementation
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data[:, 2:4], iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=1000)
model.fit(X_train_sc, y_train)
print("Coefficients per class:")
for i, cls in enumerate(iris.target_names):
print(f" {cls:12s}: {model.coef_[i].round(3)}")
print(f"\nTest accuracy: {model.score(X_test_sc, y_test):.4f}")Coefficients per class:
setosa : [-1.432 -1.095]
versicolor : [ 0.582 -0.421]
virginica : [ 0.850 1.516]
Test accuracy: 0.9667
model.coef_ has shape (K, p) — one row per class. The Virginica classifier has a large positive coefficient for petal_width (1.516) because wider petals strongly predict Virginica.
OvR vs Softmax (Multinomial)
| Aspect | OvR | Softmax (Multinomial) |
|---|---|---|
| Number of classifiers | K (one per class) | 1 joint classifier |
| Probabilities sum to 1 | No (raw); yes after normalization | Always by construction |
| Training cost | K separate fits | 1 joint optimization |
| Works with any binary classifier | Yes | No — requires probability outputs |
| Better for imbalanced classes | Easier to adjust per-class | Harder |
| sklearn setting | multi_class='ovr' | multi_class='multinomial' |
The key architectural difference: OvR classifiers share no information during training. Classifier 1 doesn't know that Classifier 3 will claim the same region. Softmax solves a joint optimization where the sum constraint is enforced during training — better calibrated probabilities but requires that your model can output probabilities (logistic regression can; SVMs cannot without calibration).
OvR Prediction Rule
- Train binary classifiers (one per class)
- For new sample : compute for
- Normalize:
- Predict:
Test Your Understanding
-
The sum of raw OvR probabilities for Versicolor-1 is 1.950. If you added a fourth class (Iris setosa hybrid) with a classifier outputting for this sample, would the final prediction (after normalization) still be Virginica?
-
OvR trains K=3 binary classifiers on a dataset with n=150 samples. Each classifier trains on all 150 samples (just with relabeled y). How does the class imbalance differ across the 3 binary problems? Which classifier sees the most severe imbalance?
-
The OvR test accuracy is 96.67%. The two Versicolor samples were misclassified in our 6-sample trace. Are these same samples likely misclassified on the full 150-sample model? Why or why not?
-
You train OvR on a 10-class problem with 5,000 samples. How many total binary classifiers are trained, and what is the size of the training set (with labels) for each?
-
Softmax guarantees probabilities sum to 1 by construction. If OvR's normalized probabilities are ≈[0.0, 0.487, 0.513] for a sample, what additional information would Softmax's training use that OvR ignores?