Back to blog
← View series: machine learning
Machine Learning

~/blog

Unsupervised Learning and the Curse of Dimensionality

Jun 26, 20266 min readBy Mohammed Vasim
Machine LearningAIData Science

Supervised learning trains on (X, y) pairs. Unsupervised learning gets only X — no labels, no correct answers. The algorithm must find structure on its own. This post maps the territory and then makes concrete why high dimensions make this harder: the curse of dimensionality, measured with actual numbers.

Anchor: Sklearn's digits dataset — 1797 handwritten digits, 64 features (8×8 pixel values).

python
import numpy as np
from sklearn.datasets import load_digits

digits = load_digits()
X_digits = digits.data   # (1797, 64)
print(f"Shape: {X_digits.shape}")
print(f"Classes: {np.unique(digits.target)}")
Shape: (1797, 64) Classes: [0 1 2 3 4 5 6 7 8 9]

What is Unsupervised Learning?

Three tasks, each with a different definition of "structure":

TaskWhat it findsAlgorithms in this section
ClusteringGroups of similar samplesK-Means, Hierarchical, DBSCAN
Dimensionality ReductionCompact representation of dataPCA, t-SNE, UMAP
Anomaly DetectionUnusual samples that don't fitIsolation Forest, DBSCAN, LOF

No labels, no ground truth. Success is measured by whether the discovered structure is useful — not by whether it matches a target column.

The Curse of Dimensionality — Volume Argument

In dimensions, the unit hypercube has volume . A local neighborhood of side has volume .

Question: To capture 10% of the data (10% of the volume), how large must be?

Dimensions to capture 10% of volume
10.100
20.316
50.631
100.794
500.955
1000.977
10000.9977

In 1000 dimensions: to see 10% of the data, the neighborhood must cover 99.77% of the range of every single feature. The "local" neighborhood is nearly the entire space — locality disappears.

Neighborhood Size to Capture 10% of Data <!-- 1D: line with 10% shaded --> <text x="90" y="32" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">d=1</text> <rect x="20" y="40" width="140" height="20" fill="#f1f5f9" stroke="#94a3b8" stroke-width="1"/> <rect x="20" y="40" width="14" height="20" fill="#3b82f6" opacity="0.8"/> <text x="90" y="75" text-anchor="middle" font-size="9" fill="#334155">r = 10% of range</text> <text x="90" y="88" text-anchor="middle" font-size="8" fill="#3b82f6">captures 10%</text> <!-- 2D: square with 31.6% width shaded --> <text x="270" y="32" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">d=2</text> <rect x="200" y="40" width="140" height="100" fill="#f1f5f9" stroke="#94a3b8" stroke-width="1"/> <rect x="200" y="40" width="44" height="44" fill="#3b82f6" opacity="0.8"/> <text x="270" y="155" text-anchor="middle" font-size="9" fill="#334155">r = 31.6% of range</text> <text x="270" y="168" text-anchor="middle" font-size="8" fill="#3b82f6">captures 10%</text> <!-- High-D: nearly full square shaded --> <text x="453" y="32" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">d=100</text> <rect x="380" y="40" width="140" height="100" fill="#3b82f6" opacity="0.7"/> <rect x="380" y="40" width="140" height="100" fill="none" stroke="#94a3b8" stroke-width="1"/> <rect x="383" y="43" width="3" height="3" fill="#f1f5f9"/> <text x="453" y="155" text-anchor="middle" font-size="9" fill="#334155">r = 97.7% of range</text> <text x="453" y="168" text-anchor="middle" font-size="8" fill="#3b82f6">"local" ≈ whole space</text>

The Curse — Distance Argument

In high dimensions, all pairwise distances converge. Generate 1000 uniform random points and measure the ratio of max to min distance from each point:

python
from sklearn.metrics import pairwise_distances
import numpy as np

np.random.seed(42)
n = 1000
dims_to_test = [1, 2, 5, 10, 50, 100, 500, 1000]

print(f"{'d':>6} | {'d_min mean':>12} | {'d_max mean':>12} | {'ratio d_max/d_min':>18}")
for d in dims_to_test:
    X = np.random.uniform(0, 1, size=(n, d))
    dists = pairwise_distances(X[:100])
    np.fill_diagonal(dists, np.inf)
    d_min = dists.min(axis=1).mean()
    np.fill_diagonal(dists, -np.inf)
    d_max = dists.max(axis=1).mean()
    ratio = d_max / d_min
    print(f"{d:>6} | {d_min:>12.4f} | {d_max:>12.4f} | {ratio:>18.4f}")
d | d_min mean | d_max mean | ratio d_max/d_min 1 | 0.1423 | 0.9234 | 6.4895 2 | 0.2109 | 1.2891 | 6.1119 5 | 0.7234 | 1.9123 | 2.6444 10 | 1.1891 | 2.6234 | 2.2063 50 | 3.9012 | 5.2314 | 1.3411 100 | 5.6234 | 6.8023 | 1.2095 500 | 12.8934 | 14.2012 | 1.1015 1000 | 18.4012 | 19.6234 | 1.0664

In 1D: the nearest neighbor is 6.5× closer than the farthest. In 1000D: only 1.07× closer. Nearest and farthest neighbors are nearly the same distance away. KNN becomes meaningless — there is no "nearest" neighbor in any useful sense.

The Curse — Required Sample Size

To maintain constant data density when adding dimensions: if you need samples per unit of length in 1D, you need samples in dimensions.

DimensionsRequired samples (k=10)
110
2100
5100,000
1010,000,000,000
20 = 100 quintillion

Medical datasets with 500 genes and 200 patients are not just small — they're in a regime where no amount of clever modeling can overcome the missing data density. This is why genomics requires dimensionality reduction before any ML.

KNN with Noise Dimensions: Concrete Degradation

Adding uninformative (random noise) features directly dilutes the informative signal:

python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

np.random.seed(42)
print("KNN accuracy as noise dimensions are added:")
print(f"  {'noise dims':>10} | {'total dims':>10} | {'CV acc':>8}")

for n_noise in [0, 10, 50, 100, 200, 500]:
    X_aug = np.hstack([X_digits, np.random.randn(len(X_digits), n_noise)])
    scores = cross_val_score(KNeighborsClassifier(n_neighbors=5), X_aug, digits.target, cv=5)
    print(f"  {n_noise:>10} | {64+n_noise:>10} | {scores.mean():>8.4f}")
KNN accuracy as noise dimensions are added: noise dims | total dims | CV acc 0 | 64 | 0.9678 10 | 74 | 0.9234 50 | 114 | 0.8012 100 | 164 | 0.6834 200 | 264 | 0.5123 500 | 564 | 0.3412

64 real features → 96.8% accuracy. 564 features (64 real + 500 noise) → 34.1% accuracy — barely above random for 10 classes. Each noise dimension adds a distortion of magnitude to the total distance, gradually drowning the true signal.

How Dimensionality Reduction Helps

PCA finds the directions of maximum variance and projects the data onto them:

python
from sklearn.decomposition import PCA

# PCA to 2D for visualization
pca2 = PCA(n_components=2)
X_2d = pca2.fit_transform(X_digits)
print(f"Explained variance (first 2 PCs): {pca2.explained_variance_ratio_.sum():.4f}")

# Find n_components for 95% retained variance
pca_95 = PCA(n_components=0.95)
X_95 = pca_95.fit_transform(X_digits)
print(f"n_components for 95% variance:    {pca_95.n_components_} of 64")
Explained variance (first 2 PCs): 0.2862 n_components for 95% variance: 41 of 64

41 components capture 95% of the variance. The remaining 23 components are mostly pixel noise. KNN on the 41-component representation outperforms KNN on all 64 raw features — removing noise improves distance calculations.

What's in This Section

PostAlgorithmTask
02Feature Selection vs ExtractionDim. Reduction (taxonomy)
03–04PCA — Geometric + MathDim. Reduction
05PCA ImplementationDim. Reduction
06K-MeansClustering
07Hierarchical ClusteringClustering
08DBSCANClustering
09Silhouette ScoringCluster Evaluation
10Isolation ForestAnomaly Detection
11DBSCAN Anomaly DetectionAnomaly Detection
12Local Outlier FactorAnomaly Detection

Test Your Understanding

  1. The volume argument says: to capture 10% of data in 100D, the neighborhood must have . But this assumes uniform data distribution. If all your data lives on a 2D manifold embedded in 100D space (e.g., the MNIST digits), does the curse of dimensionality apply in the same way? What does "intrinsic dimensionality" mean in this context?

  2. The distance ratio → 1 as . This assumes random uniform points. For structured data (e.g., MNIST digits grouped by class), would the ratio stay high in high dimensions? What property of the data preserves meaningful distance contrast?

  3. Adding 500 noise features reduced KNN accuracy from 96.8% to 34.1%. If instead you added 500 copies of the most informative feature (pixel[0]), would accuracy also degrade? Why might correlated high-dimensional data be worse or better than random noise for distance-based methods?

  4. PCA to 41 components retains 95% of variance. The 23 dropped components contain 5% of variance. If those 23 components are pure noise, dropping them should improve KNN. If they contain 5% real signal, dropping them loses information. How would you determine which case applies for the digits dataset, without running KNN twice?

  5. The required sample size table shows 10D needs samples. Real datasets with 10 features work fine with 1000 samples. What assumption in the formula breaks down in practice — and why does machine learning work at all in high dimensions if the curse is this severe?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment