~/blog

SVM Kernels

Jun 26, 2026•8 min read•By Mohammed Vasim

Machine LearningAIData Science

The linear SVM finds the maximum-margin hyperplane in the original feature space. When classes aren't linearly separable — like the XOR pattern — no hyperplane works, no matter how wide a margin you allow. The kernel trick solves this without ever explicitly computing a high-dimensional transformation.

Anchor dataset: XOR classification. Four points in the positive class, four in the negative class, arranged so no line can separate them.

python

import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

X = np.array([
    [ 1,  1], [ 2,  2], [-1, -1], [-2, -2],   # positive class: same sign
    [ 1, -1], [ 2, -2], [-1,  1], [-2,  2],   # negative class: opposite sign
])
y = np.array([1, 1, 1, 1, -1, -1, -1, -1])

svm_linear = SVC(kernel='linear', C=1.0)
svm_linear.fit(X, y)
train_acc = svm_linear.score(X, y)
print(f"Linear SVM train accuracy on XOR: {train_acc:.2f}")

Linear SVM train accuracy on XOR: 0.50

Linear SVM gets only 50% accuracy — equivalent to random guessing. The XOR pattern requires a nonlinear boundary.

Why Linear Fails on XOR

<text x="265" y="270" text-anchor="middle" font-size="11" fill="#334155">x₁</text>
<text x="25" y="140" text-anchor="middle" font-size="11" fill="#334155" transform="rotate(-90,25,140)">x₂</text>

<circle cx="308" cy="77" r="8" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<circle cx="366" cy="39" r="8" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<circle cx="192" cy="193" r="8" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<circle cx="134" cy="231" r="8" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<text x="380" y="30" font-size="9" fill="#3b82f6">y=+1</text>

<rect x="300" y="185" width="14" height="14" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<rect x="358" y="222" width="14" height="14" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<rect x="184" y="65" width="14" height="14" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<rect x="126" y="27" width="14" height="14" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<text x="375" y="220" font-size="9" fill="#ef4444">y=−1</text>

<text x="320" y="220" font-size="9" fill="#64748b">(1,−1)</text>
<text x="320" y="70" font-size="9" fill="#64748b">(1,1)</text>
<text x="140" y="220" font-size="9" fill="#64748b">(−1,1)</text>
<text x="140" y="70" font-size="9" fill="#64748b">(−1,−1)</text>

<text x="68" y="160" font-size="8" fill="#94a3b8">Positive class: same sign quadrants</text>
<text x="68" y="172" font-size="8" fill="#94a3b8">Negative class: opposite sign quadrants</text>

Positive class sits in Q1 and Q3 (same-sign quadrants); negative class in Q2 and Q4. Any line that separates Q1 from Q4 fails to separate Q3 from Q2 — the pattern wraps around.

The Feature Map Solution

Instead of drawing a line in 2D, map to a 3D space where the classes become linearly separable. The feature map $ϕ : R^{2} \to R^{3}$ :

$ϕ (x_{1}, x_{2}) = (x_{1}^{2}, x_{2}^{2}, x_{1} x_{2})$

The third component $z_{3} = x_{1} x_{2}$ captures the sign interaction — positive when $x_{1}, x_{2}$ have the same sign, negative otherwise.

Point	$(x_{1}, x_{2})$	$y$	$z_{1} = x_{1}^{2}$	$z_{2} = x_{2}^{2}$	$z_{3} = x_{1} x_{2}$
(1,1)	+1	+1	1	1	+1
(2,2)	+1	+1	4	4	+4
(−1,−1)	+1	+1	1	1	+1
(−2,−2)	+1	+1	4	4	+4
(1,−1)	−1	−1	1	1	−1
(2,−2)	−1	−1	4	4	−4
(−1,1)	−1	−1	1	1	−1
(−2,2)	−1	−1	4	4	−4

In the lifted space, all positive class points have $z_{3} > 0$ and all negative class points have $z_{3} < 0$ . The hyperplane $z_{3} = 0$ (i.e., $x_{1} x_{2} = 0$ ) separates them perfectly.

The Kernel Trick

Computing $ϕ (x_{i})$ explicitly then taking dot products is expensive — the mapped space can be millions of dimensions. The kernel trick shows that what appears in the dual objective is only the dot product $ϕ (x_{i}) \cdot ϕ (x_{j})$ , and this inner product can often be computed without ever computing $ϕ (x_{i})$ explicitly.

A kernel function is:

$K (x_{i}, x_{j}) = ϕ (x_{i}) \cdot ϕ (x_{j})$

computed directly in the original space. For the polynomial kernel of degree 2:

$K (x_{i}, x_{j}) = (x_{i} \cdot x_{j} + 1)^{2}$

Expanding: $(x_{i} \cdot x_{j} + 1)^{2} = (x_{i 1} x_{j 1} + x_{i 2} x_{j 2})^{2} + 2 (x_{i 1} x_{j 1} + x_{i 2} x_{j 2}) + 1$

This equals $ϕ^{'} (x_{i}) \cdot ϕ^{'} (x_{j})$ for the feature map $ϕ^{'} (x) = (x_{1}^{2}, x_{2}^{2}, 2 x_{1} x_{2}, 2 x_{1}, 2 x_{2}, 1)$ — without ever computing the 6-dimensional vector.

Polynomial Kernel Values on the Anchor

$K (x_{i}, x_{j}) = (x_{i} \cdot x_{j} + 1)^{2}$

Computing for representative pairs:

Pair	$x_{i} \cdot x_{j}$	$K (x_{i}, x_{j}) = (\cdot + 1)^{2}$	Same class?
(1,1) vs (2,2)	$1 (2) + 1 (2) = 4$	$(4 + 1)^{2} = 25$	Yes (+1)
(1,1) vs (1,−1)	$1 (1) + 1 (- 1) = 0$	$(0 + 1)^{2} = 1$	No
(1,1) vs (−1,−1)	$1 (- 1) + 1 (- 1) = - 2$	$(- 2 + 1)^{2} = 1$	Yes (+1)
(1,−1) vs (2,−2)	$1 (2) + (- 1) (- 2) = 4$	$(4 + 1)^{2} = 25$	Yes (−1)
(1,1) vs (−2,2)	$1 (- 2) + 1 (2) = 0$	$(0 + 1)^{2} = 1$	No

Same-class pairs share high kernel values (25) when features have the same magnitude. Cross-class pairs get $K = 1$ — the offset from the constant term only, no sign interaction.

The RBF Kernel

The Radial Basis Function (Gaussian) kernel is the most commonly used:

$K (x_{i}, x_{j}) = exp (- γ ∥ x_{i} - x_{j} ∥^{2})$

$γ$ controls how quickly similarity decays with distance
As $γ \to 0$ : all points are equally similar (underfitting)
As $γ \to \infty$ : only identical points are similar (overfitting)
RBF corresponds to an infinite-dimensional feature map — it can represent any smooth decision boundary

RBF kernel values for $γ = 0.5$ on the anchor:

Pair	$∥ x_{i} - x_{j} ∥^{2}$	$K = e^{- 0.5 \cdot d^{2}}$	Same class?
(1,1) vs (2,2)	$(1)^{2} + (1)^{2} = 2$	$e^{- 1.0} = 0.368$	Yes (+1)
(1,1) vs (1,−1)	$0 + (2)^{2} = 4$	$e^{- 2.0} = 0.135$	No
(1,1) vs (−1,−1)	$(2)^{2} + (2)^{2} = 8$	$e^{- 4.0} = 0.018$	Yes (+1)

The (1,1) vs (−1,−1) pair gets $K = 0.018$ — even though they're in the same class, they're far apart in distance. The kernel doesn't know class labels; it only measures similarity by proximity.

$γ$ Effect on the Decision Boundary

python

import matplotlib
matplotlib.use('Agg')
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

X = np.array([[1,1],[2,2],[-1,-1],[-2,-2],[1,-1],[2,-2],[-1,1],[-2,2]])
y = np.array([1,1,1,1,-1,-1,-1,-1])

gammas = [0.1, 1, 10, 100]
for gamma in gammas:
    svm = SVC(kernel='rbf', C=1.0, gamma=gamma)
    svm.fit(X, y)
    train_acc = svm.score(X, y)
    print(f"gamma={gamma:>5}: train_acc={train_acc:.2f}, n_SV={svm.support_vectors_.shape[0]}")

gamma=  0.1: train_acc=0.50, n_SV=8
gamma=    1: train_acc=1.00, n_SV=4
gamma=   10: train_acc=1.00, n_SV=8
gamma=  100: train_acc=1.00, n_SV=8

At $γ = 0.1$ : still fails — the RBF kernel is too smooth (each point's influence extends very far, averaging out the XOR structure). At $γ = 1$ : perfectly separates — four support vectors, tight boundary. At $γ = 10, 100$ : still 100% train accuracy but all 8 points become support vectors — the kernel is so local that each point only "sees" itself.

<rect x="10" y="16" width="130" height="145" fill="#fff0f0" rx="3"/>
<rect x="165" y="16" width="130" height="145" fill="#f0fff4" rx="3"/>
<rect x="325" y="16" width="130" height="145" fill="#f0fff4" rx="3"/>
<rect x="480" y="16" width="130" height="145" fill="#f0fff4" rx="3"/>

<line x1="10" y1="88" x2="140" y2="88" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="75" y1="16" x2="75" y2="161" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="165" y1="88" x2="295" y2="88" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="230" y1="16" x2="230" y2="161" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="325" y1="88" x2="455" y2="88" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="390" y1="16" x2="390" y2="161" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="480" y1="88" x2="610" y2="88" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="545" y1="16" x2="545" y2="161" stroke="#cbd5e1" stroke-width="0.8"/>

<line x1="10" y1="88" x2="140" y2="88" stroke="#ef4444" stroke-width="2"/>
<text x="75" y="130" text-anchor="middle" font-size="8" fill="#ef4444">straight line</text>
<text x="75" y="140" text-anchor="middle" font-size="8" fill="#ef4444">50% accuracy</text>

<path d="M 165,88 Q 210,30 230,88 Q 250,146 295,88" fill="none" stroke="#22c55e" stroke-width="2"/>
<path d="M 165,88 Q 210,146 230,88 Q 250,30 295,88" fill="none" stroke="#22c55e" stroke-width="2"/>
<text x="230" y="152" text-anchor="middle" font-size="8" fill="#22c55e">X-shaped boundary</text>
<text x="230" y="162" text-anchor="middle" font-size="8" fill="#22c55e">100% accuracy, 4 SV</text>

<ellipse cx="355" cy="65" rx="22" ry="18" fill="none" stroke="#22c55e" stroke-width="1.5"/>
<ellipse cx="425" cy="111" rx="22" ry="18" fill="none" stroke="#22c55e" stroke-width="1.5"/>
<ellipse cx="355" cy="111" rx="22" ry="18" fill="none" stroke="#22c55e" stroke-width="1.5"/>
<ellipse cx="425" cy="65" rx="22" ry="18" fill="none" stroke="#22c55e" stroke-width="1.5"/>
<text x="390" y="152" text-anchor="middle" font-size="8" fill="#22c55e">100% accuracy, 8 SV</text>
<text x="390" y="162" text-anchor="middle" font-size="8" fill="#22c55e">(overfit in prod)</text>

<ellipse cx="510" cy="58" rx="14" ry="12" fill="none" stroke="#f59e0b" stroke-width="1.5"/>
<ellipse cx="580" cy="118" rx="14" ry="12" fill="none" stroke="#f59e0b" stroke-width="1.5"/>
<ellipse cx="510" cy="118" rx="14" ry="12" fill="none" stroke="#f59e0b" stroke-width="1.5"/>
<ellipse cx="580" cy="58" rx="14" ry="12" fill="none" stroke="#f59e0b" stroke-width="1.5"/>
<text x="545" y="152" text-anchor="middle" font-size="8" fill="#f59e0b">100% train, but</text>
<text x="545" y="162" text-anchor="middle" font-size="8" fill="#f59e0b">circles each point</text>

<circle cx="100" cy="63" r="5" fill="#3b82f6"/>
<circle cx="120" cy="43" r="5" fill="#3b82f6"/>
<circle cx="50" cy="113" r="5" fill="#3b82f6"/>
<circle cx="30" cy="133" r="5" fill="#3b82f6"/>
<rect x="95" y="108" width="9" height="9" fill="#ef4444"/>
<rect x="115" y="128" width="9" height="9" fill="#ef4444"/>
<rect x="45" y="58" width="9" height="9" fill="#ef4444"/>
<rect x="25" y="38" width="9" height="9" fill="#ef4444"/>

Choosing a Kernel

Kernel	Formula	When to use
Linear	$x_{i} \cdot x_{j}$	High-dimensional data, text, when $d ≫ n$
Polynomial (deg $p$ )	$(x_{i} \cdot x_{j} + c)^{p}$	Image patches, polynomial relationships
RBF (Gaussian)	$exp (- γ ∥ x_{i} - x_{j} ∥^{2})$	Default choice; works when decision boundary is complex
Sigmoid	$tanh (κ x_{i} \cdot x_{j} + c)$	Rarely used; not always a valid kernel

RBF is the default for unknown structure. Linear wins when the data is already high-dimensional (text BOW, embeddings) because: (a) it's often linearly separable in high dimensions already, (b) the dual has $n$ variables but the primal has $d$ — using the primal (with LinearSVC) is faster.

Mercer's Theorem

A function $K (x, z)$ is a valid kernel (corresponds to some feature map $ϕ$ ) if and only if the kernel matrix $K_{ij} = K (x_{i}, x_{j})$ is positive semi-definite for any set of points. This is Mercer's condition. The sigmoid kernel violates this for some parameter choices — SVM convergence is not guaranteed when using non-PSD kernels.

Test Your Understanding

The feature map $ϕ (x_{1}, x_{2}) = (x_{1}^{2}, x_{2}^{2}, x_{1} x_{2})$ separates XOR. What hyperplane in the 3D lifted space corresponds to the decision boundary $x_{1} x_{2} = 0$ ? Express it in terms of $z_{1}, z_{2}, z_{3}$ .
The polynomial kernel $K (x_{i}, x_{j}) = (x_{i} \cdot x_{j} + 1)^{2}$ computes a 6-dimensional dot product without explicitly computing 6-dimensional vectors. What is the computational cost of computing $K (x_{i}, x_{j})$ in the original 2D space? What would it cost to explicitly compute $ϕ (x_{i}) \cdot ϕ (x_{j})$ in 6D?
At $γ = 0.1$ , the RBF kernel fails on XOR (50% accuracy). Explain geometrically: if the RBF kernel is very smooth (large influence radius), what does the learned decision function look like?
At $γ = 100$ , all 8 training samples become support vectors. This achieves 100% train accuracy but is likely to generalize poorly. Given the KKT conditions from post 02, what does it mean when every point is a support vector?
The linear kernel SVC fails on XOR but would succeed on the lifted XOR data (with $z_{3} = x_{1} x_{2}$ as a third feature). If you add $z_{3}$ as a feature and use a linear kernel, would the result be identical to using an RBF kernel with $γ = 1$ on the original data? Why or why not?

SVM Kernels

Why Linear Fails on XOR

The Feature Map Solution

The Kernel Trick

Polynomial Kernel Values on the Anchor

The RBF Kernel

$γ$ Effect on the Decision Boundary

Choosing a Kernel

Mercer's Theorem

Test Your Understanding

Comments (0)

Leave a comment

SVM Kernels

Why Linear Fails on XOR

The Feature Map Solution

The Kernel Trick

Polynomial Kernel Values on the Anchor

The RBF Kernel

γ Effect on the Decision Boundary

Choosing a Kernel

Mercer's Theorem

Test Your Understanding

Comments (0)

Leave a comment

$γ$ Effect on the Decision Boundary