Back to blog
← View series: machine learning

~/blog

Multiclass Logistic Regression: OvR (One vs Rest)

Jun 26, 20266 min readBy Mohammed Vasim
Machine LearningAIData Science

Binary logistic regression outputs . With 3 classes you can't use one model — you need a strategy. One-vs-Rest (OvR) decomposes the 3-class problem into 3 binary problems, one per class. Each classifier asks: "Is this sample in my class, or not?" The final prediction is the class with the highest confidence. The catch: the raw probability outputs from 3 independent classifiers don't sum to 1.

Anchor dataset: Iris flowers — petal length and petal width classify 3 species.

python
from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()
# Classes: 0=Setosa, 1=Versicolor, 2=Virginica
# Features used: petal_length (col 2), petal_width (col 3)

# 6-sample hand-trace subset (2 per class)
X_trace = np.array([
    [1.4, 0.2],  # Setosa
    [1.5, 0.4],  # Setosa
    [4.7, 1.4],  # Versicolor
    [4.5, 1.5],  # Versicolor
    [6.1, 2.3],  # Virginica
    [5.8, 1.8],  # Virginica
])
y_trace = np.array([0, 0, 1, 1, 2, 2])

The Multiclass Problem

Logistic regression is binary by design: it models versus . Two extension strategies:

  • One-vs-Rest (OvR): train binary classifiers (one per class). Each is fit independently. At prediction time, run all and pick the class with highest confidence.
  • Softmax (Multinomial): train one joint classifier that directly outputs probabilities summing to 1.

OvR is simpler, works with any binary classifier, and is sklearn's default for logistic regression.

Training 3 Binary Classifiers

For each class, relabel the 6-sample anchor: the class of interest becomes 1, all others become 0.

Classifier 1 — Setosa vs {Versicolor, Virginica}:

Samplepetal_lpetal_w
Setosa-11.40.21
Setosa-21.50.41
Versicolor-14.71.40
Versicolor-24.51.50
Virginica-16.12.30
Virginica-25.81.80

Classifiers 2 and 3 use the same table with the y column relabeled: Versicolor=1 for Classifier 2, Virginica=1 for Classifier 3.

Approximate weights learned by sklearn (stated, not hand-derived):

  • Classifier 1 (Setosa):
  • Classifier 2 (Versicolor):
  • Classifier 3 (Virginica):

Per-Class Probability Trace

Compute and for each classifier on sample Versicolor-1 (petal_l=4.7, petal_w=1.4):

Classifier computation
Setosa−25.1
Versicolor2.950.950
Virginica8.45

Decision: argmax of [0.000, 0.950, 1.000] → Virginica (wrong — true class is Versicolor).

OvR's Key Weakness: Probabilities Don't Sum to 1

The three sigmoid values sum to , not 1. Each classifier is trained independently without knowledge of the others — there's no constraint enforcing that the probabilities are collectively coherent.

Sklearn normalizes by dividing each by the sum:

Final prediction: Virginica (0.513 > 0.487). Even after normalization, Versicolor-1 is still misclassified — Classifier 3 (Virginica) assigns a score of 1.000 because the Virginica vs {Setosa, Versicolor} boundary places many Versicolor samples on the Virginica side.

Full 6-sample prediction table:

SampleSetosa Versicolor Virginica PredictionTrue Class
Setosa-1 (1.4, 0.2)≈1.000≈0.001≈0.000Setosa0 ✓
Setosa-2 (1.5, 0.4)≈0.999≈0.003≈0.000Setosa0 ✓
Versicolor-1 (4.7, 1.4)≈0.0000.950≈1.000Virginica1 ✗
Versicolor-2 (4.5, 1.5)≈0.0000.920≈0.999Virginica1 ✗
Virginica-1 (6.1, 2.3)≈0.000≈0.010≈1.000Virginica2 ✓
Virginica-2 (5.8, 1.8)≈0.000≈0.030≈0.999Virginica2 ✓

Versicolor is the hard class — its petal dimensions overlap with Virginica. The 2D feature space (petal_l vs petal_w) doesn't fully separate these two species, and the Virginica classifier (trained to detect anything that's not Setosa or Versicolor) picks up large-petal Versicolor samples.

petal length petal width <polygon points="50,245 50,15 190,15 190,245" fill="#dbeafe" fill-opacity="0.5"/> <polygon points="190,245 190,15 350,15 350,245" fill="#dcfce7" fill-opacity="0.5"/> <polygon points="350,245 350,15 470,15 470,245" fill="#fef3c7" fill-opacity="0.5"/> <line x1="190" y1="15" x2="190" y2="245" stroke="#3b82f6" stroke-width="1.5" stroke-dasharray="4,3"/> <line x1="350" y1="15" x2="350" y2="245" stroke="#22c55e" stroke-width="1.5" stroke-dasharray="4,3"/> <text x="120" y="240" text-anchor="middle" font-size="10" fill="#3b82f6" font-weight="bold">Setosa</text> <text x="270" y="240" text-anchor="middle" font-size="10" fill="#22c55e" font-weight="bold">Versicolor</text> <text x="410" y="240" text-anchor="middle" font-size="10" fill="#f59e0b" font-weight="bold">Virginica</text> <circle cx="75" cy="226" r="6" fill="#3b82f6"/> <circle cx="82" cy="218" r="6" fill="#3b82f6"/> <circle cx="310" cy="176" r="6" fill="#22c55e"/> <circle cx="300" cy="170" r="6" fill="#22c55e"/> <circle cx="390" cy="110" r="6" fill="#f59e0b"/> <circle cx="372" cy="140" r="6" fill="#f59e0b"/> <circle cx="310" cy="176" r="8" fill="none" stroke="#ef4444" stroke-width="2"/> <text x="321" y="175" font-size="8" fill="#ef4444">✗</text> <circle cx="300" cy="170" r="8" fill="none" stroke="#ef4444" stroke-width="2"/> <text x="311" y="169" font-size="8" fill="#ef4444">✗</text>

The two Versicolor samples (circled in red) sit in the Virginica decision region because Classifier 3 draws a boundary that encloses large-petal samples regardless of species.

sklearn OvR Implementation

python
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data[:, 2:4], iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=1000)
model.fit(X_train_sc, y_train)

print("Coefficients per class:")
for i, cls in enumerate(iris.target_names):
    print(f"  {cls:12s}: {model.coef_[i].round(3)}")

print(f"\nTest accuracy: {model.score(X_test_sc, y_test):.4f}")
Coefficients per class: setosa : [-1.432 -1.095] versicolor : [ 0.582 -0.421] virginica : [ 0.850 1.516] Test accuracy: 0.9667

model.coef_ has shape (K, p) — one row per class. The Virginica classifier has a large positive coefficient for petal_width (1.516) because wider petals strongly predict Virginica.

OvR vs Softmax (Multinomial)

AspectOvRSoftmax (Multinomial)
Number of classifiersK (one per class)1 joint classifier
Probabilities sum to 1No (raw); yes after normalizationAlways by construction
Training costK separate fits1 joint optimization
Works with any binary classifierYesNo — requires probability outputs
Better for imbalanced classesEasier to adjust per-classHarder
sklearn settingmulti_class='ovr'multi_class='multinomial'

The key architectural difference: OvR classifiers share no information during training. Classifier 1 doesn't know that Classifier 3 will claim the same region. Softmax solves a joint optimization where the sum constraint is enforced during training — better calibrated probabilities but requires that your model can output probabilities (logistic regression can; SVMs cannot without calibration).

OvR Prediction Rule

  1. Train binary classifiers (one per class)
  2. For new sample : compute for
  3. Normalize:
  4. Predict:

Test Your Understanding

  1. The sum of raw OvR probabilities for Versicolor-1 is 1.950. If you added a fourth class (Iris setosa hybrid) with a classifier outputting for this sample, would the final prediction (after normalization) still be Virginica?

  2. OvR trains K=3 binary classifiers on a dataset with n=150 samples. Each classifier trains on all 150 samples (just with relabeled y). How does the class imbalance differ across the 3 binary problems? Which classifier sees the most severe imbalance?

  3. The OvR test accuracy is 96.67%. The two Versicolor samples were misclassified in our 6-sample trace. Are these same samples likely misclassified on the full 150-sample model? Why or why not?

  4. You train OvR on a 10-class problem with 5,000 samples. How many total binary classifiers are trained, and what is the size of the training set (with labels) for each?

  5. Softmax guarantees probabilities sum to 1 by construction. If OvR's normalized probabilities are ≈[0.0, 0.487, 0.513] for a sample, what additional information would Softmax's training use that OvR ignores?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment