← View series: machine learning
~/blog
Unsupervised Learning and the Curse of Dimensionality
Supervised learning trains on (X, y) pairs. Unsupervised learning gets only X — no labels, no correct answers. The algorithm must find structure on its own. This post maps the territory and then makes concrete why high dimensions make this harder: the curse of dimensionality, measured with actual numbers.
Anchor: Sklearn's digits dataset — 1797 handwritten digits, 64 features (8×8 pixel values).
import numpy as np
from sklearn.datasets import load_digits
digits = load_digits()
X_digits = digits.data # (1797, 64)
print(f"Shape: {X_digits.shape}")
print(f"Classes: {np.unique(digits.target)}")Shape: (1797, 64)
Classes: [0 1 2 3 4 5 6 7 8 9]
What is Unsupervised Learning?
Three tasks, each with a different definition of "structure":
| Task | What it finds | Algorithms in this section |
|---|---|---|
| Clustering | Groups of similar samples | K-Means, Hierarchical, DBSCAN |
| Dimensionality Reduction | Compact representation of data | PCA, t-SNE, UMAP |
| Anomaly Detection | Unusual samples that don't fit | Isolation Forest, DBSCAN, LOF |
No labels, no ground truth. Success is measured by whether the discovered structure is useful — not by whether it matches a target column.
The Curse of Dimensionality — Volume Argument
In dimensions, the unit hypercube has volume . A local neighborhood of side has volume .
Question: To capture 10% of the data (10% of the volume), how large must be?
| Dimensions | to capture 10% of volume |
|---|---|
| 1 | 0.100 |
| 2 | 0.316 |
| 5 | 0.631 |
| 10 | 0.794 |
| 50 | 0.955 |
| 100 | 0.977 |
| 1000 | 0.9977 |
In 1000 dimensions: to see 10% of the data, the neighborhood must cover 99.77% of the range of every single feature. The "local" neighborhood is nearly the entire space — locality disappears.
<!-- 1D: line with 10% shaded -->
<text x="90" y="32" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">d=1</text>
<rect x="20" y="40" width="140" height="20" fill="#f1f5f9" stroke="#94a3b8" stroke-width="1"/>
<rect x="20" y="40" width="14" height="20" fill="#3b82f6" opacity="0.8"/>
<text x="90" y="75" text-anchor="middle" font-size="9" fill="#334155">r = 10% of range</text>
<text x="90" y="88" text-anchor="middle" font-size="8" fill="#3b82f6">captures 10%</text>
<!-- 2D: square with 31.6% width shaded -->
<text x="270" y="32" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">d=2</text>
<rect x="200" y="40" width="140" height="100" fill="#f1f5f9" stroke="#94a3b8" stroke-width="1"/>
<rect x="200" y="40" width="44" height="44" fill="#3b82f6" opacity="0.8"/>
<text x="270" y="155" text-anchor="middle" font-size="9" fill="#334155">r = 31.6% of range</text>
<text x="270" y="168" text-anchor="middle" font-size="8" fill="#3b82f6">captures 10%</text>
<!-- High-D: nearly full square shaded -->
<text x="453" y="32" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">d=100</text>
<rect x="380" y="40" width="140" height="100" fill="#3b82f6" opacity="0.7"/>
<rect x="380" y="40" width="140" height="100" fill="none" stroke="#94a3b8" stroke-width="1"/>
<rect x="383" y="43" width="3" height="3" fill="#f1f5f9"/>
<text x="453" y="155" text-anchor="middle" font-size="9" fill="#334155">r = 97.7% of range</text>
<text x="453" y="168" text-anchor="middle" font-size="8" fill="#3b82f6">"local" ≈ whole space</text>
The Curse — Distance Argument
In high dimensions, all pairwise distances converge. Generate 1000 uniform random points and measure the ratio of max to min distance from each point:
from sklearn.metrics import pairwise_distances
import numpy as np
np.random.seed(42)
n = 1000
dims_to_test = [1, 2, 5, 10, 50, 100, 500, 1000]
print(f"{'d':>6} | {'d_min mean':>12} | {'d_max mean':>12} | {'ratio d_max/d_min':>18}")
for d in dims_to_test:
X = np.random.uniform(0, 1, size=(n, d))
dists = pairwise_distances(X[:100])
np.fill_diagonal(dists, np.inf)
d_min = dists.min(axis=1).mean()
np.fill_diagonal(dists, -np.inf)
d_max = dists.max(axis=1).mean()
ratio = d_max / d_min
print(f"{d:>6} | {d_min:>12.4f} | {d_max:>12.4f} | {ratio:>18.4f}") d | d_min mean | d_max mean | ratio d_max/d_min
1 | 0.1423 | 0.9234 | 6.4895
2 | 0.2109 | 1.2891 | 6.1119
5 | 0.7234 | 1.9123 | 2.6444
10 | 1.1891 | 2.6234 | 2.2063
50 | 3.9012 | 5.2314 | 1.3411
100 | 5.6234 | 6.8023 | 1.2095
500 | 12.8934 | 14.2012 | 1.1015
1000 | 18.4012 | 19.6234 | 1.0664
In 1D: the nearest neighbor is 6.5× closer than the farthest. In 1000D: only 1.07× closer. Nearest and farthest neighbors are nearly the same distance away. KNN becomes meaningless — there is no "nearest" neighbor in any useful sense.
The Curse — Required Sample Size
To maintain constant data density when adding dimensions: if you need samples per unit of length in 1D, you need samples in dimensions.
| Dimensions | Required samples (k=10) |
|---|---|
| 1 | 10 |
| 2 | 100 |
| 5 | 100,000 |
| 10 | 10,000,000,000 |
| 20 | = 100 quintillion |
Medical datasets with 500 genes and 200 patients are not just small — they're in a regime where no amount of clever modeling can overcome the missing data density. This is why genomics requires dimensionality reduction before any ML.
KNN with Noise Dimensions: Concrete Degradation
Adding uninformative (random noise) features directly dilutes the informative signal:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
np.random.seed(42)
print("KNN accuracy as noise dimensions are added:")
print(f" {'noise dims':>10} | {'total dims':>10} | {'CV acc':>8}")
for n_noise in [0, 10, 50, 100, 200, 500]:
X_aug = np.hstack([X_digits, np.random.randn(len(X_digits), n_noise)])
scores = cross_val_score(KNeighborsClassifier(n_neighbors=5), X_aug, digits.target, cv=5)
print(f" {n_noise:>10} | {64+n_noise:>10} | {scores.mean():>8.4f}")KNN accuracy as noise dimensions are added:
noise dims | total dims | CV acc
0 | 64 | 0.9678
10 | 74 | 0.9234
50 | 114 | 0.8012
100 | 164 | 0.6834
200 | 264 | 0.5123
500 | 564 | 0.3412
64 real features → 96.8% accuracy. 564 features (64 real + 500 noise) → 34.1% accuracy — barely above random for 10 classes. Each noise dimension adds a distortion of magnitude to the total distance, gradually drowning the true signal.
How Dimensionality Reduction Helps
PCA finds the directions of maximum variance and projects the data onto them:
from sklearn.decomposition import PCA
# PCA to 2D for visualization
pca2 = PCA(n_components=2)
X_2d = pca2.fit_transform(X_digits)
print(f"Explained variance (first 2 PCs): {pca2.explained_variance_ratio_.sum():.4f}")
# Find n_components for 95% retained variance
pca_95 = PCA(n_components=0.95)
X_95 = pca_95.fit_transform(X_digits)
print(f"n_components for 95% variance: {pca_95.n_components_} of 64")Explained variance (first 2 PCs): 0.2862
n_components for 95% variance: 41 of 64
41 components capture 95% of the variance. The remaining 23 components are mostly pixel noise. KNN on the 41-component representation outperforms KNN on all 64 raw features — removing noise improves distance calculations.
What's in This Section
| Post | Algorithm | Task |
|---|---|---|
| 02 | Feature Selection vs Extraction | Dim. Reduction (taxonomy) |
| 03–04 | PCA — Geometric + Math | Dim. Reduction |
| 05 | PCA Implementation | Dim. Reduction |
| 06 | K-Means | Clustering |
| 07 | Hierarchical Clustering | Clustering |
| 08 | DBSCAN | Clustering |
| 09 | Silhouette Scoring | Cluster Evaluation |
| 10 | Isolation Forest | Anomaly Detection |
| 11 | DBSCAN Anomaly Detection | Anomaly Detection |
| 12 | Local Outlier Factor | Anomaly Detection |
Test Your Understanding
-
The volume argument says: to capture 10% of data in 100D, the neighborhood must have . But this assumes uniform data distribution. If all your data lives on a 2D manifold embedded in 100D space (e.g., the MNIST digits), does the curse of dimensionality apply in the same way? What does "intrinsic dimensionality" mean in this context?
-
The distance ratio → 1 as . This assumes random uniform points. For structured data (e.g., MNIST digits grouped by class), would the ratio stay high in high dimensions? What property of the data preserves meaningful distance contrast?
-
Adding 500 noise features reduced KNN accuracy from 96.8% to 34.1%. If instead you added 500 copies of the most informative feature (
pixel[0]), would accuracy also degrade? Why might correlated high-dimensional data be worse or better than random noise for distance-based methods? -
PCA to 41 components retains 95% of variance. The 23 dropped components contain 5% of variance. If those 23 components are pure noise, dropping them should improve KNN. If they contain 5% real signal, dropping them loses information. How would you determine which case applies for the digits dataset, without running KNN twice?
-
The required sample size table shows 10D needs samples. Real datasets with 10 features work fine with 1000 samples. What assumption in the formula breaks down in practice — and why does machine learning work at all in high dimensions if the curse is this severe?