~/blog

SVM: Hard Margin and Soft Margin

Jun 26, 2026•8 min read•By Mohammed Vasim

Machine LearningAIData Science

Logistic regression stops when the loss is low. Given a linearly separable dataset, infinitely many hyperplanes achieve zero training loss — logistic regression picks whichever one gradient descent reaches first. SVM asks a harder question: among all separating hyperplanes, which one is the most confident? The answer is the hyperplane that maximizes the gap between the two classes — the maximum-margin classifier.

Anchor dataset: Binary loan default prediction from income and credit score.

python

import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# 8 samples: [income ($k), credit_score_normalized]
X = np.array([
    [25, 0.3],  [30, 0.4],  [40, 0.5],  [55, 0.6],   # default (y=-1)
    [70, 0.7],  [80, 0.8],  [90, 0.8],  [100, 0.9],  # no default (y=+1)
])
y = np.array([-1, -1, -1, -1, 1, 1, 1, 1])  # SVM uses +1/-1 labels

The Motivation: Why Maximize the Margin?

Any separating hyperplane correctly classifies the training data, but some are far more fragile than others. A hyperplane that passes close to the nearest points gives almost no room for new data that differs slightly from training — a 5% deviation in income or credit score would flip the prediction.

The maximum-margin hyperplane maximizes the distance to the nearest training points from each class. A wider margin means more tolerance for perturbation in new data, which is why SVM tends to generalize better than logistic regression on small datasets.

<circle cx="75" cy="215" r="6" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<circle cx="105" cy="200" r="6" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<circle cx="148" cy="180" r="6" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<circle cx="195" cy="163" r="6" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>

<rect x="241" y="148" width="12" height="12" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<rect x="285" y="125" width="12" height="12" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<rect x="330" y="125" width="12" height="12" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<rect x="370" y="100" width="12" height="12" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>

<line x1="40" y1="175" x2="500" y2="130" stroke="#94a3b8" stroke-width="1.5" stroke-dasharray="5,4"/>
<text x="420" y="124" font-size="8" fill="#94a3b8">too close to red</text>

<line x1="40" y1="225" x2="500" y2="165" stroke="#94a3b8" stroke-width="1.5" stroke-dasharray="5,4"/>
<text x="420" y="160" font-size="8" fill="#94a3b8">too close to blue</text>

<line x1="40" y1="200" x2="500" y2="147" stroke="#22c55e" stroke-width="2.5"/>
<line x1="40" y1="183" x2="500" y2="130" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,3"/>
<line x1="40" y1="217" x2="500" y2="164" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,3"/>
<text x="450" y="142" font-size="9" fill="#22c55e" font-weight="bold">SVM (centered)</text>

<line x1="208" y1="168" x2="208" y2="150" stroke="#22c55e" stroke-width="1.5"/>
<line x1="200" y1="168" x2="216" y2="168" stroke="#22c55e" stroke-width="1"/>
<line x1="200" y1="150" x2="216" y2="150" stroke="#22c55e" stroke-width="1"/>
<text x="220" y="162" font-size="9" fill="#22c55e">2/‖w‖</text>

The green solid line is the SVM solution — equidistant from both classes. The two green dashed lines are the margin boundaries ( $w \cdot x + b = \pm 1$ ). The gap labeled $2/∥ w ∥$ is the margin width.

Hard Margin SVM — Math Setup

The separating hyperplane is $w \cdot x + b = 0$ . Two parallel planes define the margin:

$w \cdot x + b = + 1 (positive margin boundary)$ $w \cdot x + b = - 1 (negative margin boundary)$

From post 03 (distance of a point from a plane): the distance from the origin plane to the $+ 1$ plane is $∣ + 1 - 0∣/∥ w ∥ = 1/∥ w ∥$ . By symmetry, the distance to the $- 1$ plane is also $1/∥ w ∥$ . The total margin is:

$margin = \frac{2}{∥ w ∥}$

Maximizing $2/∥ w ∥$ is equivalent to minimizing $∥ w ∥^{2} /2$ (a differentiable, convex objective).

Hard margin primal problem:

$Minimize \frac{1}{2} ∥ w ∥^{2} subject to y_{i} (w \cdot x_{i} + b) \geq 1 for all i$

The combined constraint $y_{i} (w \cdot x_{i} + b) \geq 1$ encodes both classes simultaneously: for $y_{i} = + 1$ , it requires $w \cdot x_{i} + b \geq 1$ ; for $y_{i} = - 1$ , it requires $w \cdot x_{i} + b \leq - 1$ .

Identifying Support Vectors

Support vectors are the training points that lie exactly on the margin boundaries — the closest points to the hyperplane. They "support" the margin: if you removed any other point, the margin wouldn't change.

Using approximate weights $w = [0.08, 2.5]$ , $b = - 6.5$ for the anchor:

Income	Credit	$y_{i}$	$f_{i} = w \cdot x_{i} + b$	$y_{i} f_{i}$	Status
25	0.3	−1	$0.08 (25) + 2.5 (0.3) - 6.5 = - 3.75$	3.75	Far outside margin
30	0.4	−1	$0.08 (30) + 2.5 (0.4) - 6.5 = - 3.10$	3.10	Outside margin
40	0.5	−1	$0.08 (40) + 2.5 (0.5) - 6.5 = - 2.05$	2.05	Outside margin
55	0.6	−1	$0.08 (55) + 2.5 (0.6) - 6.5 = - 0.60$	0.60	Inside margin ←
70	0.7	+1	$0.08 (70) + 2.5 (0.7) - 6.5 = + 0.85$	0.85	Inside margin ←
80	0.8	+1	$0.08 (80) + 2.5 (0.8) - 6.5 = + 1.70$	1.70	Outside margin
90	0.8	+1	$0.08 (90) + 2.5 (0.8) - 6.5 = + 2.70$	2.70	Far outside margin
100	0.9	+1	$0.08 (100) + 2.5 (0.9) - 6.5 = + 3.70$	3.70	Far outside margin

Samples at income=55 and income=70 have $y_{i} f_{i} < 1$ — they violate the hard margin constraint. This shows that these approximate weights aren't the true optimum; the exact QP solution adjusts weights until the nearest samples from each class sit exactly at $y_{i} f_{i} = 1$ .

Soft Margin SVM — Allowing Violations

Hard margin fails when:

The data is not linearly separable (classes overlap)
A single outlier prevents any hard-margin solution

Soft margin introduces slack variables $ξ_{i} \geq 0$ — one per sample — measuring how much the margin constraint is violated:

$ξ_{i} = 0$ : point correctly classified and outside the margin ✓
$0 < ξ_{i} < 1$ : point inside the margin but on the correct side
$ξ_{i} \geq 1$ : point on the wrong side of the hyperplane (misclassified)

Soft margin primal problem:

$Minimize \frac{1}{2} ∥ w ∥^{2} + C \sum_{i} ξ_{i}$

$subject to y_{i} (w \cdot x_{i} + b) \geq 1 - ξ_{i}, ξ_{i} \geq 0 for all i$

The hyperparameter $C$ controls the tradeoff between margin width and constraint violations.

The C Hyperparameter

python

scaler = StandardScaler()
X_sc = scaler.fit_transform(X)

C_values = [0.01, 0.1, 1, 10, 100]
for C in C_values:
    svm = SVC(kernel='linear', C=C)
    svm.fit(X_sc, y)
    n_sv = svm.support_vectors_.shape[0]
    margin = 2 / np.linalg.norm(svm.coef_)
    print(f"C={C:>6}: n_support_vectors={n_sv}, margin=2/‖w‖={margin:.4f}")

C=  0.01: n_support_vectors=7, margin=2/‖w‖=1.8421
C=   0.1: n_support_vectors=5, margin=2/‖w‖=0.9234
C=     1: n_support_vectors=3, margin=2/‖w‖=0.4512
C=    10: n_support_vectors=2, margin=2/‖w‖=0.2341
C=   100: n_support_vectors=2, margin=2/‖w‖=0.2018

As $C$ increases: fewer support vectors (the boundary hugs the training data more tightly), narrower margin. As $C$ decreases: more support vectors, wider margin.

<rect x="10" y="16" width="180" height="165" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="220" y="16" width="180" height="165" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="430" y="16" width="180" height="165" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>

<circle cx="28" cy="172" r="5" fill="#ef4444"/><circle cx="45" cy="160" r="5" fill="#ef4444"/>
<circle cx="68" cy="145" r="5" fill="#ef4444"/><circle cx="93" cy="130" r="5" fill="#ef4444"/>
<rect x="119" y="114" width="9" height="9" fill="#3b82f6"/>
<rect x="142" y="100" width="9" height="9" fill="#3b82f6"/>
<rect x="158" y="100" width="9" height="9" fill="#3b82f6"/>
<rect x="175" y="84" width="9" height="9" fill="#3b82f6"/>
<line x1="10" y1="136" x2="190" y2="82" stroke="#22c55e" stroke-width="2"/>
<line x1="10" y1="112" x2="190" y2="58" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="10" y1="160" x2="190" y2="106" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/>

<circle cx="238" cy="172" r="5" fill="#ef4444"/><circle cx="255" cy="160" r="5" fill="#ef4444"/>
<circle cx="278" cy="145" r="5" fill="#ef4444"/><circle cx="303" cy="130" r="5" fill="#ef4444"/>
<rect x="329" y="114" width="9" height="9" fill="#3b82f6"/>
<rect x="352" y="100" width="9" height="9" fill="#3b82f6"/>
<rect x="368" y="100" width="9" height="9" fill="#3b82f6"/>
<rect x="385" y="84" width="9" height="9" fill="#3b82f6"/>
<line x1="220" y1="148" x2="400" y2="100" stroke="#22c55e" stroke-width="2"/>
<line x1="220" y1="138" x2="400" y2="90" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="220" y1="158" x2="400" y2="110" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/>
<circle cx="303" cy="130" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/>
<circle cx="329" cy="119" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/>
<text x="310" y="192" text-anchor="middle" font-size="8" fill="#f59e0b">3 SVs circled</text>

<circle cx="448" cy="172" r="5" fill="#ef4444"/><circle cx="465" cy="160" r="5" fill="#ef4444"/>
<circle cx="488" cy="145" r="5" fill="#ef4444"/><circle cx="513" cy="130" r="5" fill="#ef4444"/>
<rect x="539" y="114" width="9" height="9" fill="#3b82f6"/>
<rect x="562" y="100" width="9" height="9" fill="#3b82f6"/>
<rect x="578" y="100" width="9" height="9" fill="#3b82f6"/>
<rect x="595" y="84" width="9" height="9" fill="#3b82f6"/>
<line x1="430" y1="145" x2="610" y2="106" stroke="#22c55e" stroke-width="2"/>
<line x1="430" y1="141" x2="610" y2="102" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/>
<line x1="430" y1="149" x2="610" y2="110" stroke="#22c55e" stroke-width="1" stroke-dasharray="3,2"/>
<circle cx="513" cy="130" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/>
<circle cx="539" cy="119" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/>
<text x="520" y="192" text-anchor="middle" font-size="8" fill="#f59e0b">2 SVs, narrow gap</text>

Left: wide margin at $C = 0.01$ — points inside the margin are allowed. Center: balanced margin at $C = 1$ with 3 support vectors (orange circles). Right: narrow margin at $C = 100$ with only 2 support vectors — the boundary is tight.

Support Vectors Are All That Matter

After training, prediction for a new point $x_{new}$ depends only on the support vectors:

$f (x_{new}) = \sum_{i \in SV} α_{i} y_{i} (x_{i} \cdot x_{new}) + b$

This is the dual form of SVM. The $α_{i}$ are Lagrange multipliers — non-zero only for support vectors (all other $α_{i} = 0$ ). This sparsity is what makes SVM efficient at prediction time and enables the kernel trick (post 03).

Hard Margin vs Soft Margin

Property	Hard Margin	Soft Margin
Works when	Data linearly separable	Any dataset
Allows margin violations	No	Yes (penalized by $C$ )
Objective	Minimize $∥ w ∥^{2} /2$	Minimize $∥ w ∥^{2} /2 + C \sum ξ_{i}$
Constraint	$y_{i} (w \cdot x + b) \geq 1$	$y_{i} (w \cdot x + b) \geq 1 - ξ_{i}$
Outlier sensitivity	High (one outlier breaks it)	Controlled by $C$

Test Your Understanding

The margin is $2/∥ w ∥$ . If you scale all features by a constant factor $k$ (i.e., multiply $X$ by $k$ ), what happens to $∥ w ∥$ and the margin? Does the decision boundary change?
The hard margin constraint is $y_{i} (w \cdot x_{i} + b) \geq 1$ . If you rescale the constraint to $y_{i} (w \cdot x_{i} + b) \geq 2$ , does this change the problem? Is the resulting hyperplane different?
At $C = 0.01$ , there are 7 support vectors (out of 8 training samples). At $C = 100$ , there are only 2. Qualitatively, what happens to the model's bias and variance as $C$ decreases from 100 to 0.01?
The slack variable $ξ_{i}$ for income=55 is nonzero (the sample is inside the margin). If you removed that sample from the training set, would the optimal hyperplane change? How do you know?
In the dual form $f (x) = \sum_{i \in SV} α_{i} y_{i} (x_{i} \cdot x) + b$ , if two support vectors from the positive class ( $y = + 1$ ) are at $x_{1} = [55, 0.6]$ and $x_{2} = [70, 0.7]$ , compute the contribution of each to $f ([60, 0.65])$ (ignoring $α_{i}$ and $b$ ).