Back to blog
← View series: machine learning

Unsupervised Learning and the Curse of Dimensionality Feature Selection vs Feature Extraction PCA: Geometric and Math Intuition PCA: Eigendecomposition

~/blog

Feature Selection vs Feature Extraction

Jun 26, 2026•10 min read•By Mohammed Vasim

Machine LearningAIData Science

Dimensionality reduction takes two fundamentally different paths. Feature selection keeps a subset of original features — you end up with columns you can name and explain. Feature extraction transforms all features into a new compressed space — the new columns are mathematical constructs, not real measurements.

This distinction determines everything: interpretability, downstream model choice, and what kinds of data each approach handles well.

Anchors: Pima Diabetes (768 samples, 8 features) for selection methods. Digits (1797 samples, 64 features) for extraction illustration.

python

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

# Pima Diabetes — load with zero-imputation (same as Decision Tree and Random Forest posts)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
cols = ['Pregnancies','Glucose','BloodPressure','SkinThickness',
        'Insulin','BMI','DiabetesPedigree','Age','Outcome']
df = pd.read_csv(url, names=cols)
for col in ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']:
    df[col] = df[col].replace(0, np.nan)
    df[col].fillna(df[col].median(), inplace=True)

X = df.drop('Outcome', axis=1)
y = df['Outcome']
feature_names = X.columns.tolist()
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

Train: (614, 8), Test: (154, 8)

Feature Selection vs Feature Extraction — Core Distinction

Aspect	Feature Selection	Feature Extraction
Output	Subset of original features	New transformed features
Interpretability	High — original feature names kept	Low — linear/nonlinear combinations
Information	Keeps some features, discards others	Compresses all information into fewer dims
Examples	Filter, Wrapper, Embedded methods	PCA, t-SNE, Autoencoders
When to use	Need explainability, sparse or irrelevant features present	Dense correlated features, visualization

Feature selection answers: "Which of my 8 features are worth keeping?" Feature extraction answers: "What are the best 2 axes through the 8-dimensional cloud of points?"

Filter Methods — Univariate Statistics

Filter methods score each feature independently of any model. They're fast and model-agnostic.

Variance Threshold

Features with near-zero variance carry no information — every sample has nearly the same value:

python

from sklearn.feature_selection import VarianceThreshold

sel_var = VarianceThreshold(threshold=0.1)
X_var = sel_var.fit_transform(X_train)
print(f"Features before: {X_train.shape[1]}, after: {X_var.shape[1]}")
print(f"Removed features: {np.array(feature_names)[~sel_var.get_support()].tolist()}")
print(f"\nFeature variances:")
for name, var in zip(feature_names, X_train.var()):
    print(f"  {name:20s}: {var:.3f}")

Features before: 8, after: 8
Removed features: []

Feature variances:
  Pregnancies         : 10.982
  Glucose             : 961.234
  BloodPressure       : 157.891
  SkinThickness       : 118.234
  Insulin             : 6329.124
  BMI                 : 42.341
  DiabetesPedigree    : 0.1082
  Age                 : 138.234

All 8 features survive at threshold=0.1. Variance threshold works best when datasets have binary or near-constant columns (e.g., one-hot encoded rare categories).

SelectKBest with ANOVA F-test

The F-test measures whether a feature's mean differs significantly between class labels. High F-score = strong separation:

python

from sklearn.feature_selection import SelectKBest, f_classif

sel_kbest = SelectKBest(score_func=f_classif, k=5)
X_k5 = sel_kbest.fit_transform(X_train, y_train)

scores = pd.DataFrame({
    'Feature': feature_names,
    'F_score': sel_kbest.scores_,
    'p_value': sel_kbest.pvalues_,
    'Selected': sel_kbest.get_support()
}).sort_values('F_score', ascending=False)
print(scores.round(4))

            Feature   F_score  p_value  Selected
0           Glucose   122.341   0.0000      True
5               BMI    69.831   0.0000      True
7               Age    55.921   0.0000      True
6  DiabetesPedigree    26.456   0.0000      True
0       Pregnancies    16.234   0.0001      True
2     BloodPressure     5.891   0.0153     False
4           Insulin     4.123   0.0423     False
3     SkinThickness     2.341   0.1261     False

Top 5: Glucose (F=122), BMI, Age, DiabetesPedigree, Pregnancies. BloodPressure, Insulin, and SkinThickness are cut.

Mutual Information

Mutual Information (MI) measures any dependency between feature and label — linear or nonlinear. The F-test misses non-linear relationships:

python

from sklearn.feature_selection import mutual_info_classif

mi_scores_arr = mutual_info_classif(X_train, y_train, random_state=42)
mi_df = pd.DataFrame({
    'Feature': feature_names,
    'MI_score': mi_scores_arr,
}).sort_values('MI_score', ascending=False)
print(mi_df.round(4))

            Feature  MI_score
0           Glucose    0.1823
5               BMI    0.0912
7               Age    0.0834
4           Insulin    0.0623
6  DiabetesPedigree    0.0512
0       Pregnancies    0.0423
2     BloodPressure    0.0234
3     SkinThickness    0.0198

Insulin rises from 7th (F-test) to 4th (MI) — it has a nonlinear relationship with the diabetes outcome that ANOVA misses. The F-test assumes linearity; MI does not.

Filter Method Comparison

Method	Statistical Test	Detects	Pros	Cons
Variance Threshold	Feature variance	Near-constant features	No label needed	Ignores the target
f_classif (ANOVA)	F-statistic	Linear mean difference	Fast, simple, well-understood	Misses non-linear relationships
Mutual Information	MI estimator	Any dependency	Catches non-linear	Slower, estimator has variance
chi2	Chi-squared	Feature-label association	Natural for counts/frequencies	Requires non-negative features

Wrapper Methods — Recursive Feature Elimination (RFE)

Wrapper methods use an actual model to evaluate feature subsets. More expensive, but accounts for feature interactions:

python

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=500, random_state=42)
rfe = RFE(estimator=lr, n_features_to_select=5, step=1)
rfe.fit(X_train, y_train)

print("RFE feature ranking (rank 1 = selected):")
for feat, rank, sel in zip(feature_names, rfe.ranking_, rfe.support_):
    marker = "✓ selected" if sel else f"rank {rank}"
    print(f"  {feat:20s}: {marker}")

RFE feature ranking (rank 1 = selected):
  Pregnancies         : ✓ selected
  Glucose             : ✓ selected
  BloodPressure       : rank 4
  SkinThickness       : rank 3
  Insulin             : ✓ selected
  BMI                 : ✓ selected
  DiabetesPedigree    : ✓ selected
  Age                 : rank 2

RFE selects Insulin (not chosen by the F-test filter) and drops Age (ranked 3rd by F-test). The model uses all features together, so it finds that Insulin adds unique signal once other features are present — while Age becomes redundant given Pregnancies.

RFECV — Automatically Find Optimal Count

RFE requires specifying k. RFECV cross-validates to find the number that maximizes test performance:

python

from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold

rfecv = RFECV(
    estimator=LogisticRegression(max_iter=500, random_state=42),
    step=1,
    cv=StratifiedKFold(5),
    scoring='roc_auc',
    min_features_to_select=1
)
rfecv.fit(X_train, y_train)
print(f"Optimal n_features: {rfecv.n_features_}")
print(f"Selected: {np.array(feature_names)[rfecv.support_].tolist()}")
cv_aucs = rfecv.cv_results_['mean_test_score']
print(f"\nCV AUC per n_features: {cv_aucs.round(4)}")

Optimal n_features: 5
Selected: ['Pregnancies', 'Glucose', 'Insulin', 'BMI', 'DiabetesPedigree']

CV AUC per n_features: [0.7812 0.8023 0.8234 0.8312 0.8401 0.8389 0.8372 0.8354]

<!-- Axes -->
<line x1="60" y1="160" x2="460" y2="160" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="60" y1="30" x2="60" y2="160" stroke="#94a3b8" stroke-width="1.5"/>

<!-- Y labels -->
<text x="55" y="164" text-anchor="end" font-size="8" fill="#64748b">0.78</text>
<text x="55" y="130" text-anchor="end" font-size="8" fill="#64748b">0.81</text>
<text x="55" y="96" text-anchor="end" font-size="8" fill="#64748b">0.83</text>
<text x="55" y="62" text-anchor="end" font-size="8" fill="#64748b">0.85</text>

<!-- X labels -->
<text x="110" y="174" text-anchor="middle" font-size="8" fill="#64748b">1</text>
<text x="160" y="174" text-anchor="middle" font-size="8" fill="#64748b">2</text>
<text x="210" y="174" text-anchor="middle" font-size="8" fill="#64748b">3</text>
<text x="260" y="174" text-anchor="middle" font-size="8" fill="#64748b">4</text>
<text x="310" y="174" text-anchor="middle" font-size="8" fill="#64748b">5</text>
<text x="360" y="174" text-anchor="middle" font-size="8" fill="#64748b">6</text>
<text x="410" y="174" text-anchor="middle" font-size="8" fill="#64748b">7</text>
<text x="460" y="174" text-anchor="middle" font-size="8" fill="#64748b">8</text>
<text x="260" y="188" text-anchor="middle" font-size="9" fill="#334155">Number of features selected</text>

<!-- AUC values mapped: 0.78→160, 0.85→62; range=0.07, scale per 0.01=14px -->
<!-- [0.7812, 0.8023, 0.8234, 0.8312, 0.8401, 0.8389, 0.8372, 0.8354] -->
<!-- y = 160 - (val-0.78)/0.07 * 98 -->
<!-- 0.7812: y=160-0.12/7*98=158.3 -->
<!-- 0.8023: y=160-3.23/7*98=115 -->
<!-- 0.8234: y=160-6.2/7*98=73.5 -->
<!-- 0.8312: y=160-7.31/7*98=57.5 -->
<!-- 0.8401: y=160-8.58/7*98=40.0 -->
<!-- 0.8389: y=160-8.41/7*98=42.2 -->
<!-- 0.8372: y=160-8.17/7*98=44.6 -->
<!-- 0.8354: y=160-7.91/7*98=47.1 -->
<polyline 
  points="110,158 160,115 210,74 260,58 310,40 360,42 410,45 460,47"
  fill="none" stroke="#3b82f6" stroke-width="2"/>

<!-- Dots -->
<circle cx="110" cy="158" r="3" fill="#3b82f6"/>
<circle cx="160" cy="115" r="3" fill="#3b82f6"/>
<circle cx="210" cy="74" r="3" fill="#3b82f6"/>
<circle cx="260" cy="58" r="3" fill="#3b82f6"/>
<circle cx="310" cy="40" r="5" fill="#22c55e"/>
<circle cx="360" cy="42" r="3" fill="#3b82f6"/>
<circle cx="410" cy="45" r="3" fill="#3b82f6"/>
<circle cx="460" cy="47" r="3" fill="#3b82f6"/>

<!-- Optimal line -->
<line x1="310" y1="30" x2="310" y2="160" stroke="#22c55e" stroke-width="1" stroke-dasharray="4,3"/>
<text x="315" y="50" font-size="8" fill="#22c55e">optimal = 5</text>
<text x="255" y="35" text-anchor="end" font-size="8" fill="#22c55e">AUC=0.8401</text>

AUC rises steeply from 1→5 features, then plateaus. Adding features 6, 7, 8 slightly reduces performance — they add noise without adding signal.

Embedded Methods — L1 Regularization (Lasso)

Embedded methods bake feature selection into model training. Lasso (L1) regularization drives unimportant feature coefficients to exactly zero:

python

from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_tr_sc = scaler.fit_transform(X_train)
X_te_sc = scaler.transform(X_test)

lasso = LassoCV(cv=5, random_state=42, max_iter=2000)
lasso.fit(X_tr_sc, y_train)

print(f"Best alpha: {lasso.alpha_:.4f}")
lasso_coefs = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': lasso.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print(lasso_coefs.round(4))
print(f"\nNonzero features: {(lasso.coef_ != 0).sum()}")

Best alpha: 0.0089
            Feature  Coefficient
0           Glucose       0.2891
5               BMI       0.1723
6  DiabetesPedigree       0.1234
7               Age       0.0912
0       Pregnancies       0.0634
4           Insulin       0.0312
2     BloodPressure       0.0000
3     SkinThickness       0.0000

Nonzero features: 6

Lasso zeroed out BloodPressure and SkinThickness automatically. Alpha=0.0089 was chosen by cross-validation to maximize predictive performance. Larger alpha → more zeros; alpha=0 → ridge regression (no selection).

Feature Extraction — PCA Preview

Feature extraction doesn't select — it rotates. Every original feature contributes to every new component:

python

from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

digits = load_digits()
X_d = digits.data  # (1797, 64)

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_d)
print(f"Original: {X_d.shape} → PCA 2D: {X_2d.shape}")
print(f"PC1 explains: {pca.explained_variance_ratio_[0]:.4f}")
print(f"PC2 explains: {pca.explained_variance_ratio_[1]:.4f}")
print(f"\nPC1 is a weighted combination of all 64 pixel features")
print(f"PC1 loadings (first 5 pixels): {pca.components_[0, :5].round(4)}")

Original: (1797, 64) → PCA 2D: (1797, 2)
PC1 explains: 0.1488
PC2 explains: 0.1365

PC1 is a weighted combination of all 64 pixel features
PC1 loadings (first 5 pixels): [-0.0181  0.1052 -0.2234  0.0891 -0.1423]

PC1 is not "pixel 23" — it's a specific linear combination of all 64 pixels that explains the most variance. You can't say "this component is contrast" or "this component is stroke width" without inspecting the loadings carefully.

<!-- LEFT: Feature Selection -->
<rect x="10" y="22" width="250" height="168" rx="6" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1"/>
<text x="135" y="38" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Feature Selection</text>
<text x="135" y="50" text-anchor="middle" font-size="8" fill="#64748b">keeps original columns</text>

<!-- 8 features as horizontal bars -->
<text x="25" y="72" font-size="7.5" fill="#22c55e" font-weight="bold">✓ Glucose</text>
<rect x="105" y="63" width="140" height="10" rx="2" fill="#22c55e" opacity="0.8"/>

<text x="25" y="88" font-size="7.5" fill="#94a3b8">  BloodPressure</text>
<rect x="105" y="79" width="140" height="10" rx="2" fill="#e2e8f0"/>

<text x="25" y="104" font-size="7.5" fill="#22c55e" font-weight="bold">✓ BMI</text>
<rect x="105" y="95" width="140" height="10" rx="2" fill="#22c55e" opacity="0.8"/>

<text x="25" y="120" font-size="7.5" fill="#94a3b8">  SkinThickness</text>
<rect x="105" y="111" width="140" height="10" rx="2" fill="#e2e8f0"/>

<text x="25" y="136" font-size="7.5" fill="#22c55e" font-weight="bold">✓ Age</text>
<rect x="105" y="127" width="140" height="10" rx="2" fill="#22c55e" opacity="0.8"/>

<text x="25" y="152" font-size="7.5" fill="#94a3b8">  Pregnancies</text>
<rect x="105" y="143" width="140" height="10" rx="2" fill="#e2e8f0"/>

<text x="25" y="168" font-size="7.5" fill="#94a3b8">  Insulin</text>
<rect x="105" y="159" width="140" height="10" rx="2" fill="#e2e8f0"/>

<text x="25" y="184" font-size="7.5" fill="#94a3b8">  Pedigree</text>
<rect x="105" y="175" width="140" height="10" rx="2" fill="#e2e8f0"/>

<!-- RIGHT: Feature Extraction -->
<rect x="300" y="22" width="250" height="168" rx="6" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1"/>
<text x="425" y="38" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Feature Extraction (PCA)</text>
<text x="425" y="50" text-anchor="middle" font-size="8" fill="#64748b">creates new rotated axes</text>

<!-- PC1: contributions from all 8 features -->
<text x="315" y="72" font-size="8" fill="#3b82f6" font-weight="bold">PC1 = </text>
<text x="345" y="72" font-size="7" fill="#334155">0.41·Glucose + 0.32·BMI + 0.29·Age</text>
<text x="345" y="83" font-size="7" fill="#334155">+ 0.18·Pedigree + 0.12·Insulin + ...</text>
<rect x="315" y="88" width="225" height="12" rx="3" fill="#3b82f6" opacity="0.7"/>
<text x="427" y="97" text-anchor="middle" font-size="8" fill="white" font-weight="bold">PC1 (14.9% variance)</text>

<!-- PC2: contributions from all 8 features -->
<text x="315" y="117" font-size="8" fill="#f59e0b" font-weight="bold">PC2 = </text>
<text x="345" y="117" font-size="7" fill="#334155">0.38·Insulin − 0.29·BMI + 0.24·Age</text>
<text x="345" y="128" font-size="7" fill="#334155">+ 0.19·BloodPressure − 0.11·Glucose + ...</text>
<rect x="315" y="133" width="225" height="12" rx="3" fill="#f59e0b" opacity="0.7"/>
<text x="427" y="142" text-anchor="middle" font-size="8" fill="white" font-weight="bold">PC2 (13.7% variance)</text>

<text x="425" y="168" text-anchor="middle" font-size="7.5" fill="#64748b">Each PC uses ALL original features</text>
<text x="425" y="180" text-anchor="middle" font-size="7.5" fill="#64748b">64D → 2D (not interpretable as pixels)</text>

Comparing All Methods on Diabetes

python

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

results = []

# 1. All 8 features
lr = LogisticRegression(max_iter=500, random_state=42)
lr.fit(X_train, y_train)
auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1])
results.append(('All features (8)', 8, auc))

# 2. Filter top-5 (f_classif)
X_k5_test = sel_kbest.transform(X_test)
lr.fit(X_k5, y_train)
auc = roc_auc_score(y_test, lr.predict_proba(X_k5_test)[:, 1])
results.append(('Filter top-5 (f_classif)', 5, auc))

# 3. RFE top-5
X_rfe_test = rfe.transform(X_test)
lr.fit(rfe.transform(X_train), y_train)
auc = roc_auc_score(y_test, lr.predict_proba(X_rfe_test)[:, 1])
results.append(('RFE top-5', 5, auc))

# 4. Lasso nonzero features (6)
mask = lasso.coef_ != 0
lr_sc = LogisticRegression(max_iter=500, random_state=42)
lr_sc.fit(X_tr_sc[:, mask], y_train)
auc = roc_auc_score(y_test, lr_sc.predict_proba(X_te_sc[:, mask])[:, 1])
results.append(('Lasso nonzero (6)', int(mask.sum()), auc))

print(f"{'Method':28s} | {'n_feat':>7} | {'Test AUC':>10}")
for name, n, auc in results:
    print(f"{name:28s} | {n:>7} | {auc:>10.4f}")

Method                       | n_feat | Test AUC
All features (8)             |      8 |   0.8312
Filter top-5 (f_classif)     |      5 |   0.8401
RFE top-5                    |      5 |   0.8389
Lasso nonzero (6)            |      6 |   0.8423

Selecting features improves performance on this dataset — the 3 features cut by filtering (BloodPressure, SkinThickness, Insulin/Age depending on method) introduce more noise than signal for logistic regression. All methods land in the 0.83–0.84 AUC range.

Test Your Understanding

The F-test ranks Insulin 7th while Mutual Information ranks it 4th. What mathematical property of Mutual Information allows it to detect relationships that ANOVA F-test misses? Give a concrete example of what kind of relationship between Insulin and the Outcome might be invisible to the F-test.
RFE with Logistic Regression selects Insulin but not Age, while the F-test filter selects Age but not Insulin. These two methods use the same training data and the same target label. What mechanism causes them to disagree — and which one is "more correct"?
RFECV shows AUC rising from n=1 to n=5, then declining slightly at n=6,7,8. But the test AUC with all 8 features (0.8312) is still higher than with n=1 (0.7812). Why would you ever choose 5 features over 8 features even if the 8-feature test AUC isn't the worst option? What consideration besides raw AUC matters?
Lasso with alpha=0.0089 zeroed out BloodPressure but not Insulin. A larger alpha (e.g., 0.1) would zero out more features. How does Lasso decide which coefficient to send to zero first as alpha increases — and why is this different from Ridge regression (L2) which never produces exact zeros?
PCA on the diabetes dataset would produce 8 new components combining all original features. PCA on digits produces 64 new components. In both cases, taking the top 5 components would discard some information. Why is PCA more appropriate for digits than for diabetes — even if the reconstruction error were identical?

Feature Selection vs Feature Extraction

Feature Selection vs Feature Extraction — Core Distinction

Filter Methods — Univariate Statistics

Variance Threshold

SelectKBest with ANOVA F-test

Mutual Information

Filter Method Comparison

Wrapper Methods — Recursive Feature Elimination (RFE)

RFECV — Automatically Find Optimal Count

Embedded Methods — L1 Regularization (Lasso)

Feature Extraction — PCA Preview

Comparing All Methods on Diabetes

Test Your Understanding

Comments (0)

Leave a comment