~/blog

The Kernel Trick

Jun 27, 2026•13 min read•By Mohammed Vasim

Machine LearningAIData Science

The kernel trick lets you train an SVM in a space with millions — or infinitely many — dimensions without ever computing a single coordinate in that space. The mechanism is a one-line substitution in the dual objective. Every formula in the dual where $x_{i} \cdot x_{j}$ appears gets replaced by $K (x_{i}, x_{j})$ . The reason this works at all is that the dual depends on dot products between samples, never on the weight vector $w$ directly.

Anchor dataset: XOR classification — four points, no linear boundary possible.

python

import numpy as np
from sklearn.svm import SVC

X = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1]])
y = np.array([1, 1, -1, -1])
# Positive class (y=+1): same-sign quadrants (Q1, Q3)
# Negative class (y=-1): opposite-sign quadrants (Q2, Q4)

Phase 1 — Why the Dual Is the Gateway

The primal SVM solves:

$min_{w, b} \frac{1}{2} ∥ w ∥^{2} subject to y_{i} (w \cdot x_{i} + b) \geq 1$

The weight vector $w$ lives in the same space as the input features. To use a nonlinear feature map $ϕ : R^{d} \to R^{D}$ , you would need to store and optimize a $D$ -dimensional $w$ — impossible when $D = \infty$ (as with the RBF kernel).

The dual objective is:

$max_{α} \sum_{i} α_{i} - \frac{1}{2} \sum_{i} \sum_{j} α_{i} α_{j} y_{i} y_{j} only dot products x_{i} \cdot x_{j}$

subject to $\sum_{i} α_{i} y_{i} = 0$ , $α_{i} \geq 0$ .

The weight vector drops out entirely. The dual depends only on pairwise inner products between training points. Substitute the feature map $ϕ$ :

$max_{α} \sum_{i} α_{i} - \frac{1}{2} \sum_{i} \sum_{j} α_{i} α_{j} y_{i} y_{j} = K (x_{i}, x_{j}) ϕ (x_{i}) \cdot ϕ (x_{j})$

A kernel function $K (x_{i}, x_{j}) = ϕ (x_{i}) \cdot ϕ (x_{j})$ computes this inner product in the lifted space directly from the original inputs — no explicit $ϕ$ computation needed. The prediction for a new point also collapses to kernel evaluations:

$f (x) = w^{*} \cdot ϕ (x) + b = \sum_{i \in SV} α_{i}^{*} y_{i} K (x_{i}, x) ϕ (x_{i}) \cdot ϕ (x) + b$

The primal needs $w \in R^{D}$ . The dual needs only the $n \times n$ kernel matrix. When $D ≫ n$ — or when $D = \infty$ — the dual is the only tractable option.

Phase 2 — The Gram Matrix

The full set of pairwise kernel values between all training points is the Gram matrix $K \in R^{n \times n}$ , where $K_{ij} = K (x_{i}, x_{j})$ . This is the only structure the SVM optimizer ever sees — not the raw features, not $ϕ$ .

The degree-2 homogeneous polynomial kernel $K (x_{i}, x_{j}) = (x_{i} \cdot x_{j})^{2}$ corresponds to the feature map $ϕ (x) = (x_{1}^{2}, 2 x_{1} x_{2}, x_{2}^{2})$ . The $2 x_{1} x_{2}$ component captures the sign interaction that separates XOR.

Verifying the feature map works: for $x = (1, 1)$ and $x^{'} = (- 1, - 1)$ (both positive class):

$ϕ (1, 1) = (1, 2, 1), ϕ (- 1, - 1) = (1, 2, 1)$

They map to the same point. For $x^{''} = (1, - 1)$ (negative class): $ϕ (1, - 1) = (1, - 2, 1)$ . The $2 x_{1} x_{2}$ component flips sign — exactly what separates the classes.

Now computing the full 4×4 Gram matrix. Every entry $K_{ij} = (x_{i} \cdot x_{j})^{2}$ :

Pair $(i, j)$	$x_{i} \cdot x_{j}$	$(x_{i} \cdot x_{j})^{2} = K_{ij}$	$y_{i} y_{j}$	$y_{i} y_{j} K_{ij}$
$(x_{1}, x_{1})$	$1 \cdot 1 + 1 \cdot 1 = 2$	$4$	$+ 1$	$+ 4$
$(x_{1}, x_{2})$	$1 \cdot (- 1) + 1 \cdot (- 1) = - 2$	$4$	$+ 1$	$+ 4$
$(x_{1}, x_{3})$	$1 \cdot 1 + 1 \cdot (- 1) = 0$	$0$	$- 1$	$0$
$(x_{3}, x_{4})$	$1 \cdot (- 1) + (- 1) \cdot 1 = - 2$	$4$	$+ 1$	$+ 4$

The full matrix (using $x_{1} = (1, 1), x_{2} = (- 1, - 1), x_{3} = (1, - 1), x_{4} = (- 1, 1)$ ):

$K = 4400440000440044$

Same-class pairs (rows 1–2 and rows 3–4) have $K_{ij} = 4$ . Cross-class pairs have $K_{ij} = 0$ . The block structure in the kernel matrix is exactly the class structure.

The SVM optimizer solves the dual using only this matrix. The 3D feature space $(x_{1}^{2}, 2 x_{1} x_{2}, x_{2}^{2})$ is never instantiated.

Phase 3 — Prediction via Kernel Evaluations

The dual solution for this symmetric XOR problem: by symmetry all four $α_{i}$ are equal. Setting $α_{i} = α$ and the constraint $\sum_{i} α_{i} y_{i} = 0$ : $α + α - α - α = 0$ ✓.

The dual objective becomes:

$4 α - \frac{1}{2} \cdot 32 α^{2} = 4 α - 16 α^{2}$

Setting the derivative to zero: $4 - 32 α = 0 \Rightarrow α^{*} = \frac{1}{8}$ for all four points.

The bias $b$ : substituting $x_{1}$ (a support vector) into $y_{1} f (x_{1}) = 1$ :

$f (x_{1}) = \frac{1}{8} (1) (4) + \frac{1}{8} (1) (4) + \frac{1}{8} (- 1) (0) + \frac{1}{8} (- 1) (0) + b = 1 + b$

Since $y_{1} f (x_{1}) = 1$ : $b = 0$ .

Now predict $x_{test} = (0.5, 0.5)$ (falls in Q1, should be class +1). Compute $K (x_{i}, x_{test}) = (x_{i} \cdot x_{test})^{2}$ for each support vector:

Support vector $x_{i}$	$x_{i} \cdot x_{test}$	$K (x_{i}, x_{test})$	$α_{i}^{*} y_{i} K$ (contribution)
$x_{1} = (1, 1)$	$0.5 + 0.5 = 1.0$	$1. 0^{2} = 1.0$	$\frac{1}{8} (+ 1) (1.0) = + 0.125$
$x_{2} = (- 1, - 1)$	$- 0.5 - 0.5 = - 1.0$	$(- 1.0)^{2} = 1.0$	$\frac{1}{8} (+ 1) (1.0) = + 0.125$
$x_{3} = (1, - 1)$	$0.5 - 0.5 = 0$	$0^{2} = 0$	$\frac{1}{8} (- 1) (0) = 0$
$x_{4} = (- 1, 1)$	$- 0.5 + 0.5 = 0$	$0^{2} = 0$	$\frac{1}{8} (- 1) (0) = 0$

$f (x_{test}) = 0.125 + 0.125 + 0 + 0 + 0 = + 0.25$

Positive — correctly classified as +1. The votes from $x_{3}$ and $x_{4}$ are exactly zero because their sign interaction ( $x_{1} x_{2} < 0$ ) is orthogonal to $x_{test}$ 's orientation.

Phase 4 — Why RBF Is Infinite-Dimensional

For the polynomial kernel, the lifted space is finite: degree-2 maps $R^{2} \to R^{3}$ . The RBF kernel $K (x, z) = exp (- γ ∥ x - z ∥^{2})$ corresponds to an infinite-dimensional space.

For scalar inputs, expand via $e^{ab} = \sum_{n = 0}^{\infty} \frac{( ab ) ^{n}}{n !}$ :

$K (x, z) = e^{- γ (x - z)^{2}} = e^{- γ x^{2}} \cdot e^{- γ z^{2}} \cdot n = 0 \sum \infty \frac{( 2 γ ) ^{n}}{n !} x^{n} z^{n} e^{2 γ x z}$

$= \sum_{n = 0}^{\infty} \frac{( 2 γ ) ^{n}}{n !} (e^{- γ x^{2}} x^{n}) (e^{- γ z^{2}} z^{n})$

This is a dot product $ϕ (x) \cdot ϕ (z)$ where:

$ϕ (x) = e^{- γ x^{2}} (1, \frac{2 γ}{1 !} x, \frac{( 2 γ ) ^{2}}{2 !} x^{2}, \frac{( 2 γ ) ^{3}}{3 !} x^{3}, \dots)$

One component for every polynomial degree from 0 to $\infty$ . The feature map is infinite-dimensional. Yet computing $K (x, z) = exp (- γ ∥ x - z ∥^{2})$ costs $O (d)$ — the same as an ordinary dot product.

RBF kernel values ( $γ = 0.5$ ) on representative anchor pairs:

Pair	$∥ x_{i} - x_{j} ∥^{2}$	$K = e^{- 0.5 \cdot d^{2}}$	Relationship
$x_{1}$ vs $x_{1}$	$0$	$e^{0} = 1.000$	Same point
$x_{1} = (1, 1)$ vs $x_{3} = (1, - 1)$	$0 + 4 = 4$	$e^{- 2} = 0.135$	Cross-class, adjacent
$x_{1} = (1, 1)$ vs $x_{4} = (- 1, 1)$	$4 + 0 = 4$	$e^{- 2} = 0.135$	Cross-class, adjacent
$x_{1} = (1, 1)$ vs $x_{2} = (- 1, - 1)$	$4 + 4 = 8$	$e^{- 4} = 0.018$	Same class, diagonal

The same-class pair $(x_{1}, x_{2})$ is further apart than the cross-class pairs $(x_{1}, x_{3})$ . XOR puts same-class points in opposite quadrants. RBF solves it not by finding same-class proximity but by fitting a complex boundary around the Q1/Q3 region.

γ Sensitivity

γ controls how quickly the RBF kernel decays with distance. Small γ: smooth kernel — each point's influence extends far. Large γ: sharp kernel — each point only sees itself.

python

from sklearn.svm import SVC
import numpy as np

X = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1]])
y = np.array([1, 1, -1, -1])

for gamma in [0.05, 0.5, 2, 20]:
    svm = SVC(kernel='rbf', C=10, gamma=gamma)
    svm.fit(X, y)
    print(f"gamma={gamma:>5}: acc={svm.score(X,y):.2f}, n_SV={len(svm.support_vectors_)}")

text

gamma= 0.05: acc=0.50, n_SV=4
gamma=  0.5: acc=1.00, n_SV=4
gamma=    2: acc=1.00, n_SV=4
gamma=   20: acc=1.00, n_SV=4

At $γ = 0.05$ : the kernel is so smooth that $K (x_{i}, x_{j}) \approx 1$ for all pairs — the optimizer sees every point as equally similar to every other point and cannot find a separating boundary. At $γ = 0.5$ : the decay is fast enough to distinguish the XOR structure. At $γ = 20$ : each point's kernel value drops to near zero within a tiny radius — the SVM draws tight circles around each training point (all 4 become support vectors) and will fail on any test point not close to a training point.

Computational Cost

Approach	Feature storage	Prediction cost	Feasibility
Explicit $ϕ$ (degree-2, $d = 2$ )	$O (d^{2}) = 6$ floats	$O (D) = O (6)$	Trivial
Explicit $ϕ$ (degree-10, $d = 2$ )	$O (d^{10}) \approx 1 0^{6}$ floats	$O (D) \approx 1 0^{6}$	Marginal
Explicit $ϕ$ (RBF)	$\infty$ dimensions	Impossible	Never
Kernel $K (x_{i}, x_{j})$	$O (n^{2})$ Gram matrix	$O (n_{SV} \cdot d)$	Feasible when $n$ small

The kernel wins when $D ≫ d$ . The trap: kernel SVMs compute the full $n \times n$ Gram matrix during training. At $n = 100, 000$ , that is $1 0^{10}$ float64 values — roughly 80 GB. This is why kernel SVMs are replaced by approximations at scale (Nyström method, random Fourier features) once $n$ exceeds roughly 50,000.

python

from sklearn.svm import SVC
from sklearn.kernel_approximation import RBFSampler
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline
import numpy as np

X = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1]])
y = np.array([1, 1, -1, -1])

exact = SVC(kernel='rbf', C=10, gamma=0.5)
approx = make_pipeline(
    RBFSampler(gamma=0.5, n_components=100, random_state=42),
    SGDClassifier(max_iter=1000, random_state=42)
)
exact.fit(X, y)
approx.fit(X, y)
print(f"Exact kernel SVM:   {exact.score(X, y):.2f}")
print(f"Approx (RFF, n=100): {approx.score(X, y):.2f}")

text

Exact kernel SVM:    1.00
Approx (RFF, n=100): 1.00

RBFSampler approximates the RBF kernel via random Fourier features — maps inputs to a fixed $D$ -dimensional space (here $D = 100$ ) and then trains a linear model. Same accuracy on this tiny dataset, but the approach scales to millions of samples.

This post depends entirely on the dual SVM formulation from post 02 — specifically that the dual objective collapses $w$ out and retains only pairwise inner products $x_{i} \cdot x_{j}$ . Without the dual, there is no substitution point. From here, the Gram matrix structure generalizes directly: kernel PCA replaces the covariance matrix with $K$ , kernel ridge regression solves $(K + λ I)^{- 1} y$ instead of $(X^{⊤} X + λ I)^{- 1} X^{⊤} y$ , and Gaussian processes use $K$ as the prior covariance. Mercer's condition (positive semi-definiteness of $K$ ) — covered in post 03 — is the formal requirement that makes the substitution valid.

The kernel trick is structurally tied to models whose training objective collapses to dot products — SVMs, kernel ridge regression, kernel PCA. Decision trees, random forests, and gradient boosting have no dual formulation of this form and cannot be kernelized. Logistic regression can be kernelized (via the dual of the regularized problem) but rarely is in practice. At $n > 50, 000$ the $O (n^{2})$ Gram matrix becomes the bottleneck: 80 GB at $n = 100, 000$ in float64, before even running an optimizer. The correct response is the Nyström approximation (subsample $m ≪ n$ points, approximate $K \approx K_{nm} K_{mm}^{- 1} K_{mn}$ ) or random Fourier features, both of which trade approximation error for tractable memory. Finally, choosing the wrong kernel is equivalent to choosing the wrong model family: a polynomial kernel of degree 3 on data with a degree-2 structure will overfit; a linear kernel on XOR will fail entirely. Kernel selection is a hyperparameter search problem, not a free choice.

Test Your Understanding

The dual objective contains $\sum_{i} \sum_{j} α_{i} α_{j} y_{i} y_{j} K_{ij}$ . In the primal, the analogous term is $∥ w ∥^{2}$ . Show algebraically that these are equal when $w^{*} = \sum_{i} α_{i}^{*} y_{i} ϕ (x_{i})$ — this is why the dual value equals the primal value at optimum.
The Gram matrix for the XOR anchor with kernel $K (x, z) = (x \cdot z)^{2}$ has a perfect $2 \times 2$ block structure. If you added a fifth point $x_{5} = (0.1, 0.1)$ (positive class), compute its row in the Gram matrix. Does the block structure hold?
At $γ = 0.05$ the RBF kernel fails on XOR (50% accuracy). Using the Taylor expansion, explain why small $γ$ causes the kernel to assign nearly equal similarity to all pairs — what does this do to the Gram matrix?
The Nyström approximation builds a rank- $m$ approximation of the $n \times n$ Gram matrix by selecting $m$ "landmark" points. If $m = 10$ and $n = 100, 000$ , what are the storage requirements for the Nyström approximation vs the full Gram matrix?
The polynomial kernel $(x \cdot z)^{2}$ separates XOR perfectly, but the kernel $(x \cdot z + 1)^{2}$ does not (all off-diagonal entries equal 1, giving no class separation). Expand $(x \cdot z + 1)^{2}$ and identify which terms in the feature map are the same for all XOR points — this reveals why the intercept destroys the structure.

The Kernel Trick

Phase 1 — Why the Dual Is the Gateway

Phase 2 — The Gram Matrix

Phase 3 — Prediction via Kernel Evaluations

Phase 4 — Why RBF Is Infinite-Dimensional

γ Sensitivity

Computational Cost

Test Your Understanding

Comments (0)

Leave a comment

The Kernel Trick

Phase 1 — Why the Dual Is the Gateway

Phase 2 — The Gram Matrix

Phase 3 — Prediction via Kernel Evaluations

Phase 4 — Why RBF Is Infinite-Dimensional

γ Sensitivity

Computational Cost

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment