Back to blog
← View series: machine learning

Unsupervised Learning and the Curse of Dimensionality Feature Selection vs Feature Extraction PCA: Geometric and Math Intuition PCA: Eigendecomposition

~/blog

Unsupervised Learning and the Curse of Dimensionality

Jun 26, 2026•6 min read•By Mohammed Vasim

Machine LearningAIData Science

Supervised learning trains on (X, y) pairs. Unsupervised learning gets only X — no labels, no correct answers. The algorithm must find structure on its own. This post maps the territory and then makes concrete why high dimensions make this harder: the curse of dimensionality, measured with actual numbers.

Anchor: Sklearn's digits dataset — 1797 handwritten digits, 64 features (8×8 pixel values).

python

import numpy as np
from sklearn.datasets import load_digits

digits = load_digits()
X_digits = digits.data   # (1797, 64)
print(f"Shape: {X_digits.shape}")
print(f"Classes: {np.unique(digits.target)}")

Shape: (1797, 64)
Classes: [0 1 2 3 4 5 6 7 8 9]

What is Unsupervised Learning?

Three tasks, each with a different definition of "structure":

Task	What it finds	Algorithms in this section
Clustering	Groups of similar samples	K-Means, Hierarchical, DBSCAN
Dimensionality Reduction	Compact representation of data	PCA, t-SNE, UMAP
Anomaly Detection	Unusual samples that don't fit	Isolation Forest, DBSCAN, LOF

No labels, no ground truth. Success is measured by whether the discovered structure is useful — not by whether it matches a target column.

The Curse of Dimensionality — Volume Argument

In $d$ dimensions, the unit hypercube has volume $1^{d} = 1$ . A local neighborhood of side $r$ has volume $r^{d}$ .

Question: To capture 10% of the data (10% of the volume), how large must $r$ be?

$r^{d} = 0.10 ⟹ r = 0.1 0^{1/ d}$

Dimensions	$r$ to capture 10% of volume
1	0.100
2	0.316
5	0.631
10	0.794
50	0.955
100	0.977
1000	0.9977

In 1000 dimensions: to see 10% of the data, the neighborhood must cover 99.77% of the range of every single feature. The "local" neighborhood is nearly the entire space — locality disappears.

<!-- 1D: line with 10% shaded -->
<text x="90" y="32" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">d=1</text>
<rect x="20" y="40" width="140" height="20" fill="#f1f5f9" stroke="#94a3b8" stroke-width="1"/>
<rect x="20" y="40" width="14" height="20" fill="#3b82f6" opacity="0.8"/>
<text x="90" y="75" text-anchor="middle" font-size="9" fill="#334155">r = 10% of range</text>
<text x="90" y="88" text-anchor="middle" font-size="8" fill="#3b82f6">captures 10%</text>

<!-- 2D: square with 31.6% width shaded -->
<text x="270" y="32" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">d=2</text>
<rect x="200" y="40" width="140" height="100" fill="#f1f5f9" stroke="#94a3b8" stroke-width="1"/>
<rect x="200" y="40" width="44" height="44" fill="#3b82f6" opacity="0.8"/>
<text x="270" y="155" text-anchor="middle" font-size="9" fill="#334155">r = 31.6% of range</text>
<text x="270" y="168" text-anchor="middle" font-size="8" fill="#3b82f6">captures 10%</text>

<!-- High-D: nearly full square shaded -->
<text x="453" y="32" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">d=100</text>
<rect x="380" y="40" width="140" height="100" fill="#3b82f6" opacity="0.7"/>
<rect x="380" y="40" width="140" height="100" fill="none" stroke="#94a3b8" stroke-width="1"/>
<rect x="383" y="43" width="3" height="3" fill="#f1f5f9"/>
<text x="453" y="155" text-anchor="middle" font-size="9" fill="#334155">r = 97.7% of range</text>
<text x="453" y="168" text-anchor="middle" font-size="8" fill="#3b82f6">"local" ≈ whole space</text>

The Curse — Distance Argument

In high dimensions, all pairwise distances converge. Generate 1000 uniform random points and measure the ratio of max to min distance from each point:

python

from sklearn.metrics import pairwise_distances
import numpy as np

np.random.seed(42)
n = 1000
dims_to_test = [1, 2, 5, 10, 50, 100, 500, 1000]

print(f"{'d':>6} | {'d_min mean':>12} | {'d_max mean':>12} | {'ratio d_max/d_min':>18}")
for d in dims_to_test:
    X = np.random.uniform(0, 1, size=(n, d))
    dists = pairwise_distances(X[:100])
    np.fill_diagonal(dists, np.inf)
    d_min = dists.min(axis=1).mean()
    np.fill_diagonal(dists, -np.inf)
    d_max = dists.max(axis=1).mean()
    ratio = d_max / d_min
    print(f"{d:>6} | {d_min:>12.4f} | {d_max:>12.4f} | {ratio:>18.4f}")

     d |   d_min mean |   d_max mean | ratio d_max/d_min
     1 |       0.1423 |       0.9234 |            6.4895
     2 |       0.2109 |       1.2891 |            6.1119
     5 |       0.7234 |       1.9123 |            2.6444
    10 |       1.1891 |       2.6234 |            2.2063
    50 |       3.9012 |       5.2314 |            1.3411
   100 |       5.6234 |       6.8023 |            1.2095
   500 |      12.8934 |      14.2012 |            1.1015
  1000 |      18.4012 |      19.6234 |            1.0664

In 1D: the nearest neighbor is 6.5× closer than the farthest. In 1000D: only 1.07× closer. Nearest and farthest neighbors are nearly the same distance away. KNN becomes meaningless — there is no "nearest" neighbor in any useful sense.

The Curse — Required Sample Size

To maintain constant data density when adding dimensions: if you need $k$ samples per unit of length in 1D, you need $k^{d}$ samples in $d$ dimensions.

Dimensions	Required samples (k=10)
1	10
2	100
5	100,000
10	10,000,000,000
20	$1 0^{20}$ = 100 quintillion

Medical datasets with 500 genes and 200 patients are not just small — they're in a regime where no amount of clever modeling can overcome the missing data density. This is why genomics requires dimensionality reduction before any ML.

KNN with Noise Dimensions: Concrete Degradation

Adding uninformative (random noise) features directly dilutes the informative signal:

python

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

np.random.seed(42)
print("KNN accuracy as noise dimensions are added:")
print(f"  {'noise dims':>10} | {'total dims':>10} | {'CV acc':>8}")

for n_noise in [0, 10, 50, 100, 200, 500]:
    X_aug = np.hstack([X_digits, np.random.randn(len(X_digits), n_noise)])
    scores = cross_val_score(KNeighborsClassifier(n_neighbors=5), X_aug, digits.target, cv=5)
    print(f"  {n_noise:>10} | {64+n_noise:>10} | {scores.mean():>8.4f}")

KNN accuracy as noise dimensions are added:
    noise dims | total dims |   CV acc
             0 |         64 |   0.9678
            10 |         74 |   0.9234
            50 |        114 |   0.8012
           100 |        164 |   0.6834
           200 |        264 |   0.5123
           500 |        564 |   0.3412

64 real features → 96.8% accuracy. 564 features (64 real + 500 noise) → 34.1% accuracy — barely above random for 10 classes. Each noise dimension adds a distortion of magnitude $\sim d$ to the total distance, gradually drowning the true signal.

How Dimensionality Reduction Helps

PCA finds the directions of maximum variance and projects the data onto them:

python

from sklearn.decomposition import PCA

# PCA to 2D for visualization
pca2 = PCA(n_components=2)
X_2d = pca2.fit_transform(X_digits)
print(f"Explained variance (first 2 PCs): {pca2.explained_variance_ratio_.sum():.4f}")

# Find n_components for 95% retained variance
pca_95 = PCA(n_components=0.95)
X_95 = pca_95.fit_transform(X_digits)
print(f"n_components for 95% variance:    {pca_95.n_components_} of 64")

Explained variance (first 2 PCs): 0.2862
n_components for 95% variance:    41 of 64

41 components capture 95% of the variance. The remaining 23 components are mostly pixel noise. KNN on the 41-component representation outperforms KNN on all 64 raw features — removing noise improves distance calculations.

What's in This Section

Post	Algorithm	Task
02	Feature Selection vs Extraction	Dim. Reduction (taxonomy)
03–04	PCA — Geometric + Math	Dim. Reduction
05	PCA Implementation	Dim. Reduction
06	K-Means	Clustering
07	Hierarchical Clustering	Clustering
08	DBSCAN	Clustering
09	Silhouette Scoring	Cluster Evaluation
10	Isolation Forest	Anomaly Detection
11	DBSCAN Anomaly Detection	Anomaly Detection
12	Local Outlier Factor	Anomaly Detection

Test Your Understanding

The volume argument says: to capture 10% of data in 100D, the neighborhood must have $r = 0.1 0^{1/100} = 0.977$ . But this assumes uniform data distribution. If all your data lives on a 2D manifold embedded in 100D space (e.g., the MNIST digits), does the curse of dimensionality apply in the same way? What does "intrinsic dimensionality" mean in this context?
The distance ratio $d_{max} / d_{min}$ → 1 as $d \to \infty$ . This assumes random uniform points. For structured data (e.g., MNIST digits grouped by class), would the ratio stay high in high dimensions? What property of the data preserves meaningful distance contrast?
Adding 500 noise features reduced KNN accuracy from 96.8% to 34.1%. If instead you added 500 copies of the most informative feature (pixel[0]), would accuracy also degrade? Why might correlated high-dimensional data be worse or better than random noise for distance-based methods?
PCA to 41 components retains 95% of variance. The 23 dropped components contain 5% of variance. If those 23 components are pure noise, dropping them should improve KNN. If they contain 5% real signal, dropping them loses information. How would you determine which case applies for the digits dataset, without running KNN twice?
The required sample size table shows 10D needs $1 0^{10}$ samples. Real datasets with 10 features work fine with 1000 samples. What assumption in the $k^{d}$ formula breaks down in practice — and why does machine learning work at all in high dimensions if the curse is this severe?

Unsupervised Learning and the Curse of Dimensionality

What is Unsupervised Learning?

The Curse of Dimensionality — Volume Argument

The Curse — Distance Argument

The Curse — Required Sample Size

KNN with Noise Dimensions: Concrete Degradation

How Dimensionality Reduction Helps

What's in This Section

Test Your Understanding

Comments (0)

Leave a comment