Back to blog
← View series: machine learning
Machine Learning

~/blog

SVM: Hard Margin and Soft Margin

Jun 26, 20268 min readBy Mohammed Vasim
Machine LearningAIData Science

Logistic regression stops when the loss is low. Given a linearly separable dataset, infinitely many hyperplanes achieve zero training loss — logistic regression picks whichever one gradient descent reaches first. SVM asks a harder question: among all separating hyperplanes, which one is the most confident? The answer is the hyperplane that maximizes the gap between the two classes — the maximum-margin classifier.

Anchor dataset: Binary loan default prediction from income and credit score.

python
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# 8 samples: [income ($k), credit_score_normalized]
X = np.array([
    [25, 0.3],  [30, 0.4],  [40, 0.5],  [55, 0.6],   # default (y=-1)
    [70, 0.7],  [80, 0.8],  [90, 0.8],  [100, 0.9],  # no default (y=+1)
])
y = np.array([-1, -1, -1, -1, 1, 1, 1, 1])  # SVM uses +1/-1 labels

The Motivation: Why Maximize the Margin?

Any separating hyperplane correctly classifies the training data, but some are far more fragile than others. A hyperplane that passes close to the nearest points gives almost no room for new data that differs slightly from training — a 5% deviation in income or credit score would flip the prediction.

The maximum-margin hyperplane maximizes the distance to the nearest training points from each class. A wider margin means more tolerance for perturbation in new data, which is why SVM tends to generalize better than logistic regression on small datasets.

Income ($k) Credit score <circle cx="75" cy="215" r="6" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/> <circle cx="105" cy="200" r="6" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/> <circle cx="148" cy="180" r="6" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/> <circle cx="195" cy="163" r="6" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/> <rect x="241" y="148" width="12" height="12" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/> <rect x="285" y="125" width="12" height="12" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/> <rect x="330" y="125" width="12" height="12" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/> <rect x="370" y="100" width="12" height="12" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/> <line x1="40" y1="175" x2="500" y2="130" stroke="#94a3b8" stroke-width="1.5" stroke-dasharray="5,4"/> <text x="420" y="124" font-size="8" fill="#94a3b8">too close to red</text> <line x1="40" y1="225" x2="500" y2="165" stroke="#94a3b8" stroke-width="1.5" stroke-dasharray="5,4"/> <text x="420" y="160" font-size="8" fill="#94a3b8">too close to blue</text> <line x1="40" y1="200" x2="500" y2="147" stroke="#22c55e" stroke-width="2.5"/> <line x1="40" y1="183" x2="500" y2="130" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,3"/> <line x1="40" y1="217" x2="500" y2="164" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,3"/> <text x="450" y="142" font-size="9" fill="#22c55e" font-weight="bold">SVM (centered)</text> <line x1="208" y1="168" x2="208" y2="150" stroke="#22c55e" stroke-width="1.5"/> <line x1="200" y1="168" x2="216" y2="168" stroke="#22c55e" stroke-width="1"/> <line x1="200" y1="150" x2="216" y2="150" stroke="#22c55e" stroke-width="1"/> <text x="220" y="162" font-size="9" fill="#22c55e">2/‖w‖</text>

The green solid line is the SVM solution — equidistant from both classes. The two green dashed lines are the margin boundaries (). The gap labeled is the margin width.

Hard Margin SVM — Math Setup

The separating hyperplane is . Two parallel planes define the margin:

From post 03 (distance of a point from a plane): the distance from the origin plane to the plane is . By symmetry, the distance to the plane is also . The total margin is:

Maximizing is equivalent to minimizing (a differentiable, convex objective).

Hard margin primal problem:

The combined constraint encodes both classes simultaneously: for , it requires ; for , it requires .

Identifying Support Vectors

Support vectors are the training points that lie exactly on the margin boundaries — the closest points to the hyperplane. They "support" the margin: if you removed any other point, the margin wouldn't change.

Using approximate weights , for the anchor:

IncomeCreditStatus
250.3−13.75Far outside margin
300.4−13.10Outside margin
400.5−12.05Outside margin
550.6−10.60Inside margin
700.7+10.85Inside margin
800.8+11.70Outside margin
900.8+12.70Far outside margin
1000.9+13.70Far outside margin

Samples at income=55 and income=70 have — they violate the hard margin constraint. This shows that these approximate weights aren't the true optimum; the exact QP solution adjusts weights until the nearest samples from each class sit exactly at .

Soft Margin SVM — Allowing Violations

Hard margin fails when:

  1. The data is not linearly separable (classes overlap)
  2. A single outlier prevents any hard-margin solution

Soft margin introduces slack variables — one per sample — measuring how much the margin constraint is violated:

  • : point correctly classified and outside the margin ✓
  • : point inside the margin but on the correct side
  • : point on the wrong side of the hyperplane (misclassified)

Soft margin primal problem:

The hyperparameter controls the tradeoff between margin width and constraint violations.

The C Hyperparameter

python
scaler = StandardScaler()
X_sc = scaler.fit_transform(X)

C_values = [0.01, 0.1, 1, 10, 100]
for C in C_values:
    svm = SVC(kernel='linear', C=C)
    svm.fit(X_sc, y)
    n_sv = svm.support_vectors_.shape[0]
    margin = 2 / np.linalg.norm(svm.coef_)
    print(f"C={C:>6}: n_support_vectors={n_sv}, margin=2/‖w‖={margin:.4f}")
C= 0.01: n_support_vectors=7, margin=2/‖w‖=1.8421 C= 0.1: n_support_vectors=5, margin=2/‖w‖=0.9234 C= 1: n_support_vectors=3, margin=2/‖w‖=0.4512 C= 10: n_support_vectors=2, margin=2/‖w‖=0.2341 C= 100: n_support_vectors=2, margin=2/‖w‖=0.2018

As increases: fewer support vectors (the boundary hugs the training data more tightly), narrower margin. As decreases: more support vectors, wider margin.

C=0.01 (wide margin) C=1 (balanced) C=100 (narrow margin) <rect x="10" y="16" width="180" height="165" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <rect x="220" y="16" width="180" height="165" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <rect x="430" y="16" width="180" height="165" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <circle cx="28" cy="172" r="5" fill="#ef4444"/><circle cx="45" cy="160" r="5" fill="#ef4444"/> <circle cx="68" cy="145" r="5" fill="#ef4444"/><circle cx="93" cy="130" r="5" fill="#ef4444"/> <rect x="119" y="114" width="9" height="9" fill="#3b82f6"/> <rect x="142" y="100" width="9" height="9" fill="#3b82f6"/> <rect x="158" y="100" width="9" height="9" fill="#3b82f6"/> <rect x="175" y="84" width="9" height="9" fill="#3b82f6"/> <line x1="10" y1="136" x2="190" y2="82" stroke="#22c55e" stroke-width="2"/> <line x1="10" y1="112" x2="190" y2="58" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/> <line x1="10" y1="160" x2="190" y2="106" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/> <circle cx="238" cy="172" r="5" fill="#ef4444"/><circle cx="255" cy="160" r="5" fill="#ef4444"/> <circle cx="278" cy="145" r="5" fill="#ef4444"/><circle cx="303" cy="130" r="5" fill="#ef4444"/> <rect x="329" y="114" width="9" height="9" fill="#3b82f6"/> <rect x="352" y="100" width="9" height="9" fill="#3b82f6"/> <rect x="368" y="100" width="9" height="9" fill="#3b82f6"/> <rect x="385" y="84" width="9" height="9" fill="#3b82f6"/> <line x1="220" y1="148" x2="400" y2="100" stroke="#22c55e" stroke-width="2"/> <line x1="220" y1="138" x2="400" y2="90" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/> <line x1="220" y1="158" x2="400" y2="110" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/> <circle cx="303" cy="130" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/> <circle cx="329" cy="119" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/> <text x="310" y="192" text-anchor="middle" font-size="8" fill="#f59e0b">3 SVs circled</text> <circle cx="448" cy="172" r="5" fill="#ef4444"/><circle cx="465" cy="160" r="5" fill="#ef4444"/> <circle cx="488" cy="145" r="5" fill="#ef4444"/><circle cx="513" cy="130" r="5" fill="#ef4444"/> <rect x="539" y="114" width="9" height="9" fill="#3b82f6"/> <rect x="562" y="100" width="9" height="9" fill="#3b82f6"/> <rect x="578" y="100" width="9" height="9" fill="#3b82f6"/> <rect x="595" y="84" width="9" height="9" fill="#3b82f6"/> <line x1="430" y1="145" x2="610" y2="106" stroke="#22c55e" stroke-width="2"/> <line x1="430" y1="141" x2="610" y2="102" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/> <line x1="430" y1="149" x2="610" y2="110" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/> <circle cx="513" cy="130" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/> <circle cx="539" cy="119" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/> <text x="520" y="192" text-anchor="middle" font-size="8" fill="#f59e0b">2 SVs, narrow gap</text>

Left: wide margin at — points inside the margin are allowed. Center: balanced margin at with 3 support vectors (orange circles). Right: narrow margin at with only 2 support vectors — the boundary is tight.

Support Vectors Are All That Matter

After training, prediction for a new point depends only on the support vectors:

This is the dual form of SVM. The are Lagrange multipliers — non-zero only for support vectors (all other ). This sparsity is what makes SVM efficient at prediction time and enables the kernel trick (post 03).

Hard Margin vs Soft Margin

PropertyHard MarginSoft Margin
Works whenData linearly separableAny dataset
Allows margin violationsNoYes (penalized by )
ObjectiveMinimize Minimize
Constraint
Outlier sensitivityHigh (one outlier breaks it)Controlled by

Test Your Understanding

  1. The margin is . If you scale all features by a constant factor (i.e., multiply by ), what happens to and the margin? Does the decision boundary change?

  2. The hard margin constraint is . If you rescale the constraint to , does this change the problem? Is the resulting hyperplane different?

  3. At , there are 7 support vectors (out of 8 training samples). At , there are only 2. Qualitatively, what happens to the model's bias and variance as decreases from 100 to 0.01?

  4. The slack variable for income=55 is nonzero (the sample is inside the margin). If you removed that sample from the training set, would the optimal hyperplane change? How do you know?

  5. In the dual form , if two support vectors from the positive class () are at and , compute the contribution of each to (ignoring and ).

Comments (0)

No comments yet. Be the first to comment!

Leave a comment