← View series: machine learning
~/blog
SVM: Hard Margin and Soft Margin
Logistic regression stops when the loss is low. Given a linearly separable dataset, infinitely many hyperplanes achieve zero training loss — logistic regression picks whichever one gradient descent reaches first. SVM asks a harder question: among all separating hyperplanes, which one is the most confident? The answer is the hyperplane that maximizes the gap between the two classes — the maximum-margin classifier.
Anchor dataset: Binary loan default prediction from income and credit score.
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
# 8 samples: [income ($k), credit_score_normalized]
X = np.array([
[25, 0.3], [30, 0.4], [40, 0.5], [55, 0.6], # default (y=-1)
[70, 0.7], [80, 0.8], [90, 0.8], [100, 0.9], # no default (y=+1)
])
y = np.array([-1, -1, -1, -1, 1, 1, 1, 1]) # SVM uses +1/-1 labelsThe Motivation: Why Maximize the Margin?
Any separating hyperplane correctly classifies the training data, but some are far more fragile than others. A hyperplane that passes close to the nearest points gives almost no room for new data that differs slightly from training — a 5% deviation in income or credit score would flip the prediction.
The maximum-margin hyperplane maximizes the distance to the nearest training points from each class. A wider margin means more tolerance for perturbation in new data, which is why SVM tends to generalize better than logistic regression on small datasets.
<circle cx="75" cy="215" r="6" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<circle cx="105" cy="200" r="6" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<circle cx="148" cy="180" r="6" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<circle cx="195" cy="163" r="6" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<rect x="241" y="148" width="12" height="12" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<rect x="285" y="125" width="12" height="12" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<rect x="330" y="125" width="12" height="12" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<rect x="370" y="100" width="12" height="12" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<line x1="40" y1="175" x2="500" y2="130" stroke="#94a3b8" stroke-width="1.5" stroke-dasharray="5,4"/>
<text x="420" y="124" font-size="8" fill="#94a3b8">too close to red</text>
<line x1="40" y1="225" x2="500" y2="165" stroke="#94a3b8" stroke-width="1.5" stroke-dasharray="5,4"/>
<text x="420" y="160" font-size="8" fill="#94a3b8">too close to blue</text>
<line x1="40" y1="200" x2="500" y2="147" stroke="#22c55e" stroke-width="2.5"/>
<line x1="40" y1="183" x2="500" y2="130" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,3"/>
<line x1="40" y1="217" x2="500" y2="164" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,3"/>
<text x="450" y="142" font-size="9" fill="#22c55e" font-weight="bold">SVM (centered)</text>
<line x1="208" y1="168" x2="208" y2="150" stroke="#22c55e" stroke-width="1.5"/>
<line x1="200" y1="168" x2="216" y2="168" stroke="#22c55e" stroke-width="1"/>
<line x1="200" y1="150" x2="216" y2="150" stroke="#22c55e" stroke-width="1"/>
<text x="220" y="162" font-size="9" fill="#22c55e">2/‖w‖</text>
The green solid line is the SVM solution — equidistant from both classes. The two green dashed lines are the margin boundaries (). The gap labeled is the margin width.
Hard Margin SVM — Math Setup
The separating hyperplane is . Two parallel planes define the margin:
From post 03 (distance of a point from a plane): the distance from the origin plane to the plane is . By symmetry, the distance to the plane is also . The total margin is:
Maximizing is equivalent to minimizing (a differentiable, convex objective).
Hard margin primal problem:
The combined constraint encodes both classes simultaneously: for , it requires ; for , it requires .
Identifying Support Vectors
Support vectors are the training points that lie exactly on the margin boundaries — the closest points to the hyperplane. They "support" the margin: if you removed any other point, the margin wouldn't change.
Using approximate weights , for the anchor:
| Income | Credit | Status | |||
|---|---|---|---|---|---|
| 25 | 0.3 | −1 | 3.75 | Far outside margin | |
| 30 | 0.4 | −1 | 3.10 | Outside margin | |
| 40 | 0.5 | −1 | 2.05 | Outside margin | |
| 55 | 0.6 | −1 | 0.60 | Inside margin ← | |
| 70 | 0.7 | +1 | 0.85 | Inside margin ← | |
| 80 | 0.8 | +1 | 1.70 | Outside margin | |
| 90 | 0.8 | +1 | 2.70 | Far outside margin | |
| 100 | 0.9 | +1 | 3.70 | Far outside margin |
Samples at income=55 and income=70 have — they violate the hard margin constraint. This shows that these approximate weights aren't the true optimum; the exact QP solution adjusts weights until the nearest samples from each class sit exactly at .
Soft Margin SVM — Allowing Violations
Hard margin fails when:
- The data is not linearly separable (classes overlap)
- A single outlier prevents any hard-margin solution
Soft margin introduces slack variables — one per sample — measuring how much the margin constraint is violated:
- : point correctly classified and outside the margin ✓
- : point inside the margin but on the correct side
- : point on the wrong side of the hyperplane (misclassified)
Soft margin primal problem:
The hyperparameter controls the tradeoff between margin width and constraint violations.
The C Hyperparameter
scaler = StandardScaler()
X_sc = scaler.fit_transform(X)
C_values = [0.01, 0.1, 1, 10, 100]
for C in C_values:
svm = SVC(kernel='linear', C=C)
svm.fit(X_sc, y)
n_sv = svm.support_vectors_.shape[0]
margin = 2 / np.linalg.norm(svm.coef_)
print(f"C={C:>6}: n_support_vectors={n_sv}, margin=2/‖w‖={margin:.4f}")C= 0.01: n_support_vectors=7, margin=2/‖w‖=1.8421
C= 0.1: n_support_vectors=5, margin=2/‖w‖=0.9234
C= 1: n_support_vectors=3, margin=2/‖w‖=0.4512
C= 10: n_support_vectors=2, margin=2/‖w‖=0.2341
C= 100: n_support_vectors=2, margin=2/‖w‖=0.2018
As increases: fewer support vectors (the boundary hugs the training data more tightly), narrower margin. As decreases: more support vectors, wider margin.
<rect x="10" y="16" width="180" height="165" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="220" y="16" width="180" height="165" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="430" y="16" width="180" height="165" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<circle cx="28" cy="172" r="5" fill="#ef4444"/><circle cx="45" cy="160" r="5" fill="#ef4444"/>
<circle cx="68" cy="145" r="5" fill="#ef4444"/><circle cx="93" cy="130" r="5" fill="#ef4444"/>
<rect x="119" y="114" width="9" height="9" fill="#3b82f6"/>
<rect x="142" y="100" width="9" height="9" fill="#3b82f6"/>
<rect x="158" y="100" width="9" height="9" fill="#3b82f6"/>
<rect x="175" y="84" width="9" height="9" fill="#3b82f6"/>
<line x1="10" y1="136" x2="190" y2="82" stroke="#22c55e" stroke-width="2"/>
<line x1="10" y1="112" x2="190" y2="58" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="10" y1="160" x2="190" y2="106" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/>
<circle cx="238" cy="172" r="5" fill="#ef4444"/><circle cx="255" cy="160" r="5" fill="#ef4444"/>
<circle cx="278" cy="145" r="5" fill="#ef4444"/><circle cx="303" cy="130" r="5" fill="#ef4444"/>
<rect x="329" y="114" width="9" height="9" fill="#3b82f6"/>
<rect x="352" y="100" width="9" height="9" fill="#3b82f6"/>
<rect x="368" y="100" width="9" height="9" fill="#3b82f6"/>
<rect x="385" y="84" width="9" height="9" fill="#3b82f6"/>
<line x1="220" y1="148" x2="400" y2="100" stroke="#22c55e" stroke-width="2"/>
<line x1="220" y1="138" x2="400" y2="90" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="220" y1="158" x2="400" y2="110" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/>
<circle cx="303" cy="130" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/>
<circle cx="329" cy="119" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/>
<text x="310" y="192" text-anchor="middle" font-size="8" fill="#f59e0b">3 SVs circled</text>
<circle cx="448" cy="172" r="5" fill="#ef4444"/><circle cx="465" cy="160" r="5" fill="#ef4444"/>
<circle cx="488" cy="145" r="5" fill="#ef4444"/><circle cx="513" cy="130" r="5" fill="#ef4444"/>
<rect x="539" y="114" width="9" height="9" fill="#3b82f6"/>
<rect x="562" y="100" width="9" height="9" fill="#3b82f6"/>
<rect x="578" y="100" width="9" height="9" fill="#3b82f6"/>
<rect x="595" y="84" width="9" height="9" fill="#3b82f6"/>
<line x1="430" y1="145" x2="610" y2="106" stroke="#22c55e" stroke-width="2"/>
<line x1="430" y1="141" x2="610" y2="102" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="430" y1="149" x2="610" y2="110" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/>
<circle cx="513" cy="130" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/>
<circle cx="539" cy="119" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/>
<text x="520" y="192" text-anchor="middle" font-size="8" fill="#f59e0b">2 SVs, narrow gap</text>
Left: wide margin at — points inside the margin are allowed. Center: balanced margin at with 3 support vectors (orange circles). Right: narrow margin at with only 2 support vectors — the boundary is tight.
Support Vectors Are All That Matter
After training, prediction for a new point depends only on the support vectors:
This is the dual form of SVM. The are Lagrange multipliers — non-zero only for support vectors (all other ). This sparsity is what makes SVM efficient at prediction time and enables the kernel trick (post 03).
Hard Margin vs Soft Margin
| Property | Hard Margin | Soft Margin |
|---|---|---|
| Works when | Data linearly separable | Any dataset |
| Allows margin violations | No | Yes (penalized by ) |
| Objective | Minimize | Minimize |
| Constraint | ||
| Outlier sensitivity | High (one outlier breaks it) | Controlled by |
Test Your Understanding
-
The margin is . If you scale all features by a constant factor (i.e., multiply by ), what happens to and the margin? Does the decision boundary change?
-
The hard margin constraint is . If you rescale the constraint to , does this change the problem? Is the resulting hyperplane different?
-
At , there are 7 support vectors (out of 8 training samples). At , there are only 2. Qualitatively, what happens to the model's bias and variance as decreases from 100 to 0.01?
-
The slack variable for income=55 is nonzero (the sample is inside the margin). If you removed that sample from the training set, would the optimal hyperplane change? How do you know?
-
In the dual form , if two support vectors from the positive class () are at and , compute the contribution of each to (ignoring and ).