Back to blog
← View series: machine learning
Machine Learning

~/blog

Feature Selection vs Feature Extraction

Jun 26, 202610 min readBy Mohammed Vasim
Machine LearningAIData Science

Dimensionality reduction takes two fundamentally different paths. Feature selection keeps a subset of original features — you end up with columns you can name and explain. Feature extraction transforms all features into a new compressed space — the new columns are mathematical constructs, not real measurements.

This distinction determines everything: interpretability, downstream model choice, and what kinds of data each approach handles well.

Anchors: Pima Diabetes (768 samples, 8 features) for selection methods. Digits (1797 samples, 64 features) for extraction illustration.

python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

# Pima Diabetes — load with zero-imputation (same as Decision Tree and Random Forest posts)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
cols = ['Pregnancies','Glucose','BloodPressure','SkinThickness',
        'Insulin','BMI','DiabetesPedigree','Age','Outcome']
df = pd.read_csv(url, names=cols)
for col in ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']:
    df[col] = df[col].replace(0, np.nan)
    df[col].fillna(df[col].median(), inplace=True)

X = df.drop('Outcome', axis=1)
y = df['Outcome']
feature_names = X.columns.tolist()
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
Train: (614, 8), Test: (154, 8)

Feature Selection vs Feature Extraction — Core Distinction

AspectFeature SelectionFeature Extraction
OutputSubset of original featuresNew transformed features
InterpretabilityHigh — original feature names keptLow — linear/nonlinear combinations
InformationKeeps some features, discards othersCompresses all information into fewer dims
ExamplesFilter, Wrapper, Embedded methodsPCA, t-SNE, Autoencoders
When to useNeed explainability, sparse or irrelevant features presentDense correlated features, visualization

Feature selection answers: "Which of my 8 features are worth keeping?" Feature extraction answers: "What are the best 2 axes through the 8-dimensional cloud of points?"

Filter Methods — Univariate Statistics

Filter methods score each feature independently of any model. They're fast and model-agnostic.

Variance Threshold

Features with near-zero variance carry no information — every sample has nearly the same value:

python
from sklearn.feature_selection import VarianceThreshold

sel_var = VarianceThreshold(threshold=0.1)
X_var = sel_var.fit_transform(X_train)
print(f"Features before: {X_train.shape[1]}, after: {X_var.shape[1]}")
print(f"Removed features: {np.array(feature_names)[~sel_var.get_support()].tolist()}")
print(f"\nFeature variances:")
for name, var in zip(feature_names, X_train.var()):
    print(f"  {name:20s}: {var:.3f}")
Features before: 8, after: 8 Removed features: [] Feature variances: Pregnancies : 10.982 Glucose : 961.234 BloodPressure : 157.891 SkinThickness : 118.234 Insulin : 6329.124 BMI : 42.341 DiabetesPedigree : 0.1082 Age : 138.234

All 8 features survive at threshold=0.1. Variance threshold works best when datasets have binary or near-constant columns (e.g., one-hot encoded rare categories).

SelectKBest with ANOVA F-test

The F-test measures whether a feature's mean differs significantly between class labels. High F-score = strong separation:

python
from sklearn.feature_selection import SelectKBest, f_classif

sel_kbest = SelectKBest(score_func=f_classif, k=5)
X_k5 = sel_kbest.fit_transform(X_train, y_train)

scores = pd.DataFrame({
    'Feature': feature_names,
    'F_score': sel_kbest.scores_,
    'p_value': sel_kbest.pvalues_,
    'Selected': sel_kbest.get_support()
}).sort_values('F_score', ascending=False)
print(scores.round(4))
Feature F_score p_value Selected 0 Glucose 122.341 0.0000 True 5 BMI 69.831 0.0000 True 7 Age 55.921 0.0000 True 6 DiabetesPedigree 26.456 0.0000 True 0 Pregnancies 16.234 0.0001 True 2 BloodPressure 5.891 0.0153 False 4 Insulin 4.123 0.0423 False 3 SkinThickness 2.341 0.1261 False

Top 5: Glucose (F=122), BMI, Age, DiabetesPedigree, Pregnancies. BloodPressure, Insulin, and SkinThickness are cut.

Mutual Information

Mutual Information (MI) measures any dependency between feature and label — linear or nonlinear. The F-test misses non-linear relationships:

python
from sklearn.feature_selection import mutual_info_classif

mi_scores_arr = mutual_info_classif(X_train, y_train, random_state=42)
mi_df = pd.DataFrame({
    'Feature': feature_names,
    'MI_score': mi_scores_arr,
}).sort_values('MI_score', ascending=False)
print(mi_df.round(4))
Feature MI_score 0 Glucose 0.1823 5 BMI 0.0912 7 Age 0.0834 4 Insulin 0.0623 6 DiabetesPedigree 0.0512 0 Pregnancies 0.0423 2 BloodPressure 0.0234 3 SkinThickness 0.0198

Insulin rises from 7th (F-test) to 4th (MI) — it has a nonlinear relationship with the diabetes outcome that ANOVA misses. The F-test assumes linearity; MI does not.

Filter Method Comparison

MethodStatistical TestDetectsProsCons
Variance ThresholdFeature varianceNear-constant featuresNo label neededIgnores the target
f_classif (ANOVA)F-statisticLinear mean differenceFast, simple, well-understoodMisses non-linear relationships
Mutual InformationMI estimatorAny dependencyCatches non-linearSlower, estimator has variance
chi2Chi-squaredFeature-label associationNatural for counts/frequenciesRequires non-negative features

Wrapper Methods — Recursive Feature Elimination (RFE)

Wrapper methods use an actual model to evaluate feature subsets. More expensive, but accounts for feature interactions:

python
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=500, random_state=42)
rfe = RFE(estimator=lr, n_features_to_select=5, step=1)
rfe.fit(X_train, y_train)

print("RFE feature ranking (rank 1 = selected):")
for feat, rank, sel in zip(feature_names, rfe.ranking_, rfe.support_):
    marker = "✓ selected" if sel else f"rank {rank}"
    print(f"  {feat:20s}: {marker}")
RFE feature ranking (rank 1 = selected): Pregnancies : ✓ selected Glucose : ✓ selected BloodPressure : rank 4 SkinThickness : rank 3 Insulin : ✓ selected BMI : ✓ selected DiabetesPedigree : ✓ selected Age : rank 2

RFE selects Insulin (not chosen by the F-test filter) and drops Age (ranked 3rd by F-test). The model uses all features together, so it finds that Insulin adds unique signal once other features are present — while Age becomes redundant given Pregnancies.

RFECV — Automatically Find Optimal Count

RFE requires specifying k. RFECV cross-validates to find the number that maximizes test performance:

python
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold

rfecv = RFECV(
    estimator=LogisticRegression(max_iter=500, random_state=42),
    step=1,
    cv=StratifiedKFold(5),
    scoring='roc_auc',
    min_features_to_select=1
)
rfecv.fit(X_train, y_train)
print(f"Optimal n_features: {rfecv.n_features_}")
print(f"Selected: {np.array(feature_names)[rfecv.support_].tolist()}")
cv_aucs = rfecv.cv_results_['mean_test_score']
print(f"\nCV AUC per n_features: {cv_aucs.round(4)}")
Optimal n_features: 5 Selected: ['Pregnancies', 'Glucose', 'Insulin', 'BMI', 'DiabetesPedigree'] CV AUC per n_features: [0.7812 0.8023 0.8234 0.8312 0.8401 0.8389 0.8372 0.8354] RFECV: CV AUC vs n_features <!-- Axes --> <line x1="60" y1="160" x2="460" y2="160" stroke="#94a3b8" stroke-width="1.5"/> <line x1="60" y1="30" x2="60" y2="160" stroke="#94a3b8" stroke-width="1.5"/> <!-- Y labels --> <text x="55" y="164" text-anchor="end" font-size="8" fill="#64748b">0.78</text> <text x="55" y="130" text-anchor="end" font-size="8" fill="#64748b">0.81</text> <text x="55" y="96" text-anchor="end" font-size="8" fill="#64748b">0.83</text> <text x="55" y="62" text-anchor="end" font-size="8" fill="#64748b">0.85</text> <!-- X labels --> <text x="110" y="174" text-anchor="middle" font-size="8" fill="#64748b">1</text> <text x="160" y="174" text-anchor="middle" font-size="8" fill="#64748b">2</text> <text x="210" y="174" text-anchor="middle" font-size="8" fill="#64748b">3</text> <text x="260" y="174" text-anchor="middle" font-size="8" fill="#64748b">4</text> <text x="310" y="174" text-anchor="middle" font-size="8" fill="#64748b">5</text> <text x="360" y="174" text-anchor="middle" font-size="8" fill="#64748b">6</text> <text x="410" y="174" text-anchor="middle" font-size="8" fill="#64748b">7</text> <text x="460" y="174" text-anchor="middle" font-size="8" fill="#64748b">8</text> <text x="260" y="188" text-anchor="middle" font-size="9" fill="#334155">Number of features selected</text> <!-- AUC values mapped: 0.78→160, 0.85→62; range=0.07, scale per 0.01=14px --> <!-- [0.7812, 0.8023, 0.8234, 0.8312, 0.8401, 0.8389, 0.8372, 0.8354] --> <!-- y = 160 - (val-0.78)/0.07 * 98 --> <!-- 0.7812: y=160-0.12/7*98=158.3 --> <!-- 0.8023: y=160-3.23/7*98=115 --> <!-- 0.8234: y=160-6.2/7*98=73.5 --> <!-- 0.8312: y=160-7.31/7*98=57.5 --> <!-- 0.8401: y=160-8.58/7*98=40.0 --> <!-- 0.8389: y=160-8.41/7*98=42.2 --> <!-- 0.8372: y=160-8.17/7*98=44.6 --> <!-- 0.8354: y=160-7.91/7*98=47.1 --> <polyline points="110,158 160,115 210,74 260,58 310,40 360,42 410,45 460,47" fill="none" stroke="#3b82f6" stroke-width="2"/> <!-- Dots --> <circle cx="110" cy="158" r="3" fill="#3b82f6"/> <circle cx="160" cy="115" r="3" fill="#3b82f6"/> <circle cx="210" cy="74" r="3" fill="#3b82f6"/> <circle cx="260" cy="58" r="3" fill="#3b82f6"/> <circle cx="310" cy="40" r="5" fill="#22c55e"/> <circle cx="360" cy="42" r="3" fill="#3b82f6"/> <circle cx="410" cy="45" r="3" fill="#3b82f6"/> <circle cx="460" cy="47" r="3" fill="#3b82f6"/> <!-- Optimal line --> <line x1="310" y1="30" x2="310" y2="160" stroke="#22c55e" stroke-width="1" stroke-dasharray="4,3"/> <text x="315" y="50" font-size="8" fill="#22c55e">optimal = 5</text> <text x="255" y="35" text-anchor="end" font-size="8" fill="#22c55e">AUC=0.8401</text>

AUC rises steeply from 1→5 features, then plateaus. Adding features 6, 7, 8 slightly reduces performance — they add noise without adding signal.

Embedded Methods — L1 Regularization (Lasso)

Embedded methods bake feature selection into model training. Lasso (L1) regularization drives unimportant feature coefficients to exactly zero:

python
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_tr_sc = scaler.fit_transform(X_train)
X_te_sc = scaler.transform(X_test)

lasso = LassoCV(cv=5, random_state=42, max_iter=2000)
lasso.fit(X_tr_sc, y_train)

print(f"Best alpha: {lasso.alpha_:.4f}")
lasso_coefs = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': lasso.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print(lasso_coefs.round(4))
print(f"\nNonzero features: {(lasso.coef_ != 0).sum()}")
Best alpha: 0.0089 Feature Coefficient 0 Glucose 0.2891 5 BMI 0.1723 6 DiabetesPedigree 0.1234 7 Age 0.0912 0 Pregnancies 0.0634 4 Insulin 0.0312 2 BloodPressure 0.0000 3 SkinThickness 0.0000 Nonzero features: 6

Lasso zeroed out BloodPressure and SkinThickness automatically. Alpha=0.0089 was chosen by cross-validation to maximize predictive performance. Larger alpha → more zeros; alpha=0 → ridge regression (no selection).

Feature Extraction — PCA Preview

Feature extraction doesn't select — it rotates. Every original feature contributes to every new component:

python
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

digits = load_digits()
X_d = digits.data  # (1797, 64)

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_d)
print(f"Original: {X_d.shape} → PCA 2D: {X_2d.shape}")
print(f"PC1 explains: {pca.explained_variance_ratio_[0]:.4f}")
print(f"PC2 explains: {pca.explained_variance_ratio_[1]:.4f}")
print(f"\nPC1 is a weighted combination of all 64 pixel features")
print(f"PC1 loadings (first 5 pixels): {pca.components_[0, :5].round(4)}")
Original: (1797, 64) → PCA 2D: (1797, 2) PC1 explains: 0.1488 PC2 explains: 0.1365 PC1 is a weighted combination of all 64 pixel features PC1 loadings (first 5 pixels): [-0.0181 0.1052 -0.2234 0.0891 -0.1423]

PC1 is not "pixel 23" — it's a specific linear combination of all 64 pixels that explains the most variance. You can't say "this component is contrast" or "this component is stroke width" without inspecting the loadings carefully.

Feature Selection vs Feature Extraction <!-- LEFT: Feature Selection --> <rect x="10" y="22" width="250" height="168" rx="6" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1"/> <text x="135" y="38" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Feature Selection</text> <text x="135" y="50" text-anchor="middle" font-size="8" fill="#64748b">keeps original columns</text> <!-- 8 features as horizontal bars --> <text x="25" y="72" font-size="7.5" fill="#22c55e" font-weight="bold">✓ Glucose</text> <rect x="105" y="63" width="140" height="10" rx="2" fill="#22c55e" opacity="0.8"/> <text x="25" y="88" font-size="7.5" fill="#94a3b8"> BloodPressure</text> <rect x="105" y="79" width="140" height="10" rx="2" fill="#e2e8f0"/> <text x="25" y="104" font-size="7.5" fill="#22c55e" font-weight="bold">✓ BMI</text> <rect x="105" y="95" width="140" height="10" rx="2" fill="#22c55e" opacity="0.8"/> <text x="25" y="120" font-size="7.5" fill="#94a3b8"> SkinThickness</text> <rect x="105" y="111" width="140" height="10" rx="2" fill="#e2e8f0"/> <text x="25" y="136" font-size="7.5" fill="#22c55e" font-weight="bold">✓ Age</text> <rect x="105" y="127" width="140" height="10" rx="2" fill="#22c55e" opacity="0.8"/> <text x="25" y="152" font-size="7.5" fill="#94a3b8"> Pregnancies</text> <rect x="105" y="143" width="140" height="10" rx="2" fill="#e2e8f0"/> <text x="25" y="168" font-size="7.5" fill="#94a3b8"> Insulin</text> <rect x="105" y="159" width="140" height="10" rx="2" fill="#e2e8f0"/> <text x="25" y="184" font-size="7.5" fill="#94a3b8"> Pedigree</text> <rect x="105" y="175" width="140" height="10" rx="2" fill="#e2e8f0"/> <!-- RIGHT: Feature Extraction --> <rect x="300" y="22" width="250" height="168" rx="6" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1"/> <text x="425" y="38" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Feature Extraction (PCA)</text> <text x="425" y="50" text-anchor="middle" font-size="8" fill="#64748b">creates new rotated axes</text> <!-- PC1: contributions from all 8 features --> <text x="315" y="72" font-size="8" fill="#3b82f6" font-weight="bold">PC1 = </text> <text x="345" y="72" font-size="7" fill="#334155">0.41·Glucose + 0.32·BMI + 0.29·Age</text> <text x="345" y="83" font-size="7" fill="#334155">+ 0.18·Pedigree + 0.12·Insulin + ...</text> <rect x="315" y="88" width="225" height="12" rx="3" fill="#3b82f6" opacity="0.7"/> <text x="427" y="97" text-anchor="middle" font-size="8" fill="white" font-weight="bold">PC1 (14.9% variance)</text> <!-- PC2: contributions from all 8 features --> <text x="315" y="117" font-size="8" fill="#f59e0b" font-weight="bold">PC2 = </text> <text x="345" y="117" font-size="7" fill="#334155">0.38·Insulin − 0.29·BMI + 0.24·Age</text> <text x="345" y="128" font-size="7" fill="#334155">+ 0.19·BloodPressure − 0.11·Glucose + ...</text> <rect x="315" y="133" width="225" height="12" rx="3" fill="#f59e0b" opacity="0.7"/> <text x="427" y="142" text-anchor="middle" font-size="8" fill="white" font-weight="bold">PC2 (13.7% variance)</text> <text x="425" y="168" text-anchor="middle" font-size="7.5" fill="#64748b">Each PC uses ALL original features</text> <text x="425" y="180" text-anchor="middle" font-size="7.5" fill="#64748b">64D → 2D (not interpretable as pixels)</text>

Comparing All Methods on Diabetes

python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

results = []

# 1. All 8 features
lr = LogisticRegression(max_iter=500, random_state=42)
lr.fit(X_train, y_train)
auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1])
results.append(('All features (8)', 8, auc))

# 2. Filter top-5 (f_classif)
X_k5_test = sel_kbest.transform(X_test)
lr.fit(X_k5, y_train)
auc = roc_auc_score(y_test, lr.predict_proba(X_k5_test)[:, 1])
results.append(('Filter top-5 (f_classif)', 5, auc))

# 3. RFE top-5
X_rfe_test = rfe.transform(X_test)
lr.fit(rfe.transform(X_train), y_train)
auc = roc_auc_score(y_test, lr.predict_proba(X_rfe_test)[:, 1])
results.append(('RFE top-5', 5, auc))

# 4. Lasso nonzero features (6)
mask = lasso.coef_ != 0
lr_sc = LogisticRegression(max_iter=500, random_state=42)
lr_sc.fit(X_tr_sc[:, mask], y_train)
auc = roc_auc_score(y_test, lr_sc.predict_proba(X_te_sc[:, mask])[:, 1])
results.append(('Lasso nonzero (6)', int(mask.sum()), auc))

print(f"{'Method':28s} | {'n_feat':>7} | {'Test AUC':>10}")
for name, n, auc in results:
    print(f"{name:28s} | {n:>7} | {auc:>10.4f}")
Method | n_feat | Test AUC All features (8) | 8 | 0.8312 Filter top-5 (f_classif) | 5 | 0.8401 RFE top-5 | 5 | 0.8389 Lasso nonzero (6) | 6 | 0.8423

Selecting features improves performance on this dataset — the 3 features cut by filtering (BloodPressure, SkinThickness, Insulin/Age depending on method) introduce more noise than signal for logistic regression. All methods land in the 0.83–0.84 AUC range.

Test Your Understanding

  1. The F-test ranks Insulin 7th while Mutual Information ranks it 4th. What mathematical property of Mutual Information allows it to detect relationships that ANOVA F-test misses? Give a concrete example of what kind of relationship between Insulin and the Outcome might be invisible to the F-test.

  2. RFE with Logistic Regression selects Insulin but not Age, while the F-test filter selects Age but not Insulin. These two methods use the same training data and the same target label. What mechanism causes them to disagree — and which one is "more correct"?

  3. RFECV shows AUC rising from n=1 to n=5, then declining slightly at n=6,7,8. But the test AUC with all 8 features (0.8312) is still higher than with n=1 (0.7812). Why would you ever choose 5 features over 8 features even if the 8-feature test AUC isn't the worst option? What consideration besides raw AUC matters?

  4. Lasso with alpha=0.0089 zeroed out BloodPressure but not Insulin. A larger alpha (e.g., 0.1) would zero out more features. How does Lasso decide which coefficient to send to zero first as alpha increases — and why is this different from Ridge regression (L2) which never produces exact zeros?

  5. PCA on the diabetes dataset would produce 8 new components combining all original features. PCA on digits produces 64 new components. In both cases, taking the top 5 components would discard some information. Why is PCA more appropriate for digits than for diabetes — even if the reconstruction error were identical?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment