Back to blog
← View series: machine learning

Can Linear Regression Solve Classification?Logistic Regression: Math Intuition Classification Performance Metrics Multiclass Logistic Regression: OvR (One vs Rest)Logistic Regression: Full Implementation GridSearchCV and RandomizedSearchCV Logistic Regression on Imbalanced Data and ROC Curve Deep Dive

~/blog

Multiclass Logistic Regression: OvR (One vs Rest)

Jun 26, 2026•6 min read•By Mohammed Vasim

Machine LearningAIData Science

Binary logistic regression outputs $P (y = 1∣ x)$ . With 3 classes you can't use one model — you need a strategy. One-vs-Rest (OvR) decomposes the 3-class problem into 3 binary problems, one per class. Each classifier asks: "Is this sample in my class, or not?" The final prediction is the class with the highest confidence. The catch: the raw probability outputs from 3 independent classifiers don't sum to 1.

Anchor dataset: Iris flowers — petal length and petal width classify 3 species.

python

from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()
# Classes: 0=Setosa, 1=Versicolor, 2=Virginica
# Features used: petal_length (col 2), petal_width (col 3)

# 6-sample hand-trace subset (2 per class)
X_trace = np.array([
    [1.4, 0.2],  # Setosa
    [1.5, 0.4],  # Setosa
    [4.7, 1.4],  # Versicolor
    [4.5, 1.5],  # Versicolor
    [6.1, 2.3],  # Virginica
    [5.8, 1.8],  # Virginica
])
y_trace = np.array([0, 0, 1, 1, 2, 2])

The Multiclass Problem

Logistic regression is binary by design: it models $P (y = 1∣ x)$ versus $P (y = 0∣ x)$ . Two extension strategies:

One-vs-Rest (OvR): train $K$ binary classifiers (one per class). Each is fit independently. At prediction time, run all $K$ and pick the class with highest confidence.
Softmax (Multinomial): train one joint classifier that directly outputs $K$ probabilities summing to 1.

OvR is simpler, works with any binary classifier, and is sklearn's default for logistic regression.

Training 3 Binary Classifiers

For each class, relabel the 6-sample anchor: the class of interest becomes 1, all others become 0.

Classifier 1 — Setosa vs {Versicolor, Virginica}:

Sample	petal_l	petal_w	$y_{binary}$
Setosa-1	1.4	0.2	1
Setosa-2	1.5	0.4	1
Versicolor-1	4.7	1.4	0
Versicolor-2	4.5	1.5	0
Virginica-1	6.1	2.3	0
Virginica-2	5.8	1.8	0

Classifiers 2 and 3 use the same table with the y column relabeled: Versicolor=1 for Classifier 2, Virginica=1 for Classifier 3.

Approximate weights learned by sklearn (stated, not hand-derived):

Classifier 1 (Setosa): $w = [w_{0} = - 6.2, w_{1} = - 2.8, w_{2} = - 4.1]$
Classifier 2 (Versicolor): $w = [w_{0} = 0.4, w_{1} = 0.9, w_{2} = - 1.2]$
Classifier 3 (Virginica): $w = [w_{0} = - 5.1, w_{1} = 1.9, w_{2} = 3.3]$

Per-Class Probability Trace

Compute $z = w_{0} + w_{1} \times petal_l + w_{2} \times petal_w$ and $σ (z)$ for each classifier on sample Versicolor-1 (petal_l=4.7, petal_w=1.4):

Classifier	$z$ computation	$z$	$σ (z)$
Setosa	$- 6.2 + (- 2.8) (4.7) + (- 4.1) (1.4) = - 6.2 - 13.16 - 5.74$	−25.1	$\approx 0.000$
Versicolor	$0.4 + (0.9) (4.7) + (- 1.2) (1.4) = 0.4 + 4.23 - 1.68$	2.95	0.950
Virginica	$- 5.1 + (1.9) (4.7) + (3.3) (1.4) = - 5.1 + 8.93 + 4.62$	8.45	$\approx 1.000$

Decision: argmax of [0.000, 0.950, 1.000] → Virginica (wrong — true class is Versicolor).

OvR's Key Weakness: Probabilities Don't Sum to 1

The three sigmoid values sum to $0.000 + 0.950 + 1.000 = 1.950$ , not 1. Each classifier is trained independently without knowledge of the others — there's no constraint enforcing that the probabilities are collectively coherent.

Sklearn normalizes by dividing each by the sum:

$\overset{p}{^} (Setosa) = 0/1.95 = 0.000$ $\overset{p}{^} (Versicolor) = 0.95/1.95 = 0.487$ $\overset{p}{^} (Virginica) = 1.00/1.95 = 0.513$

Final prediction: Virginica (0.513 > 0.487). Even after normalization, Versicolor-1 is still misclassified — Classifier 3 (Virginica) assigns a score of 1.000 because the Virginica vs {Setosa, Versicolor} boundary places many Versicolor samples on the Virginica side.

Full 6-sample prediction table:

Sample	Setosa $σ$	Versicolor $σ$	Virginica $σ$	Prediction	True Class
Setosa-1 (1.4, 0.2)	≈1.000	≈0.001	≈0.000	Setosa	0 ✓
Setosa-2 (1.5, 0.4)	≈0.999	≈0.003	≈0.000	Setosa	0 ✓
Versicolor-1 (4.7, 1.4)	≈0.000	0.950	≈1.000	Virginica	1 ✗
Versicolor-2 (4.5, 1.5)	≈0.000	0.920	≈0.999	Virginica	1 ✗
Virginica-1 (6.1, 2.3)	≈0.000	≈0.010	≈1.000	Virginica	2 ✓
Virginica-2 (5.8, 1.8)	≈0.000	≈0.030	≈0.999	Virginica	2 ✓

Versicolor is the hard class — its petal dimensions overlap with Virginica. The 2D feature space (petal_l vs petal_w) doesn't fully separate these two species, and the Virginica classifier (trained to detect anything that's not Setosa or Versicolor) picks up large-petal Versicolor samples.

<polygon points="50,245 50,15 190,15 190,245" fill="#dbeafe" fill-opacity="0.5"/>
<polygon points="190,245 190,15 350,15 350,245" fill="#dcfce7" fill-opacity="0.5"/>
<polygon points="350,245 350,15 470,15 470,245" fill="#fef3c7" fill-opacity="0.5"/>

<line x1="190" y1="15" x2="190" y2="245" stroke="#3b82f6" stroke-width="1.5" stroke-dasharray="4,3"/>
<line x1="350" y1="15" x2="350" y2="245" stroke="#22c55e" stroke-width="1.5" stroke-dasharray="4,3"/>

<text x="120" y="240" text-anchor="middle" font-size="10" fill="#3b82f6" font-weight="bold">Setosa</text>
<text x="270" y="240" text-anchor="middle" font-size="10" fill="#22c55e" font-weight="bold">Versicolor</text>
<text x="410" y="240" text-anchor="middle" font-size="10" fill="#f59e0b" font-weight="bold">Virginica</text>

<circle cx="75" cy="226" r="6" fill="#3b82f6"/>
<circle cx="82" cy="218" r="6" fill="#3b82f6"/>

<circle cx="310" cy="176" r="6" fill="#22c55e"/>
<circle cx="300" cy="170" r="6" fill="#22c55e"/>

<circle cx="390" cy="110" r="6" fill="#f59e0b"/>
<circle cx="372" cy="140" r="6" fill="#f59e0b"/>

<circle cx="310" cy="176" r="8" fill="none" stroke="#ef4444" stroke-width="2"/>
<text x="321" y="175" font-size="8" fill="#ef4444">✗</text>
<circle cx="300" cy="170" r="8" fill="none" stroke="#ef4444" stroke-width="2"/>
<text x="311" y="169" font-size="8" fill="#ef4444">✗</text>

The two Versicolor samples (circled in red) sit in the Virginica decision region because Classifier 3 draws a boundary that encloses large-petal samples regardless of species.

sklearn OvR Implementation

python

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data[:, 2:4], iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=1000)
model.fit(X_train_sc, y_train)

print("Coefficients per class:")
for i, cls in enumerate(iris.target_names):
    print(f"  {cls:12s}: {model.coef_[i].round(3)}")

print(f"\nTest accuracy: {model.score(X_test_sc, y_test):.4f}")

Coefficients per class:
  setosa      : [-1.432 -1.095]
  versicolor  : [ 0.582 -0.421]
  virginica   : [ 0.850  1.516]

Test accuracy: 0.9667

model.coef_ has shape (K, p) — one row per class. The Virginica classifier has a large positive coefficient for petal_width (1.516) because wider petals strongly predict Virginica.

OvR vs Softmax (Multinomial)

Aspect	OvR	Softmax (Multinomial)
Number of classifiers	K (one per class)	1 joint classifier
Probabilities sum to 1	No (raw); yes after normalization	Always by construction
Training cost	K separate fits	1 joint optimization
Works with any binary classifier	Yes	No — requires probability outputs
Better for imbalanced classes	Easier to adjust per-class	Harder
sklearn setting	`multi_class='ovr'`	`multi_class='multinomial'`

The key architectural difference: OvR classifiers share no information during training. Classifier 1 doesn't know that Classifier 3 will claim the same region. Softmax solves a joint optimization where the sum constraint is enforced during training — better calibrated probabilities but requires that your model can output probabilities (logistic regression can; SVMs cannot without calibration).

OvR Prediction Rule

Train $K$ binary classifiers (one per class)
For new sample $x$ : compute $σ (w_{k} \cdot x)$ for $k = 1 \dots K$
Normalize: $\overset{p}{^}_{k} = σ_{k} / \sum_{i} σ_{i}$
Predict: $\overset{y}{^} = ar g max_{k} \overset{p}{^}_{k}$

Test Your Understanding

The sum of raw OvR probabilities for Versicolor-1 is 1.950. If you added a fourth class (Iris setosa hybrid) with a classifier outputting $σ = 0.3$ for this sample, would the final prediction (after normalization) still be Virginica?
OvR trains K=3 binary classifiers on a dataset with n=150 samples. Each classifier trains on all 150 samples (just with relabeled y). How does the class imbalance differ across the 3 binary problems? Which classifier sees the most severe imbalance?
The OvR test accuracy is 96.67%. The two Versicolor samples were misclassified in our 6-sample trace. Are these same samples likely misclassified on the full 150-sample model? Why or why not?
You train OvR on a 10-class problem with 5,000 samples. How many total binary classifiers are trained, and what is the size of the training set (with labels) for each?
Softmax guarantees probabilities sum to 1 by construction. If OvR's normalized probabilities are ≈[0.0, 0.487, 0.513] for a sample, what additional information would Softmax's training use that OvR ignores?