Back to blog
← View series: machine learning

~/blog

GridSearchCV and RandomizedSearchCV

Jun 26, 20267 min readBy Mohammed Vasim
Machine LearningAIData Science

Every hyperparameter in logistic regression — regularization strength C, penalty type, solver — must be set before training. Getting them right requires searching the hyperparameter space systematically. GridSearchCV evaluates every combination exhaustively; RandomizedSearchCV samples from continuous distributions. Both use cross-validation to estimate generalization performance for each candidate.

Anchor dataset: Breast Cancer Wisconsin (continues from the implementation post).

python
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

What Is Hyperparameter Tuning?

Parameters are learned from data during training: the weights that minimize the loss. Hyperparameters control the learning process and must be set before training:

HyperparameterWhat it controlsTypical range
CRegularization strength (inverse of )
penaltyType of regularizationl1, l2, elasticnet
solverOptimization algorithmlbfgs, liblinear, saga
max_iterConvergence budget

Choosing C by looking at validation performance on the full training set is wrong — the model has already seen that data. Cross-validation provides an honest estimate by training on a subset and evaluating on the held-out fold.

Grid search tests every combination in a discrete parameter grid:

python
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']  # liblinear supports both l1 and l2
}

gs = GridSearchCV(
    LogisticRegression(max_iter=10000),
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)
gs.fit(X_train_sc, y_train)

print(f"Best params: {gs.best_params_}")
print(f"Best CV AUC: {gs.best_score_:.4f}")
Fitting 5 folds for each of 10 candidates, totalling 50 fits Best params: {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'} Best CV AUC: 0.9976

Total fits = 5 (C values) × 2 (penalties) × 5 (CV folds) = 50 fits.

Examining the Full Results Grid

python
import pandas as pd

results_df = pd.DataFrame(gs.cv_results_)
pivot = results_df.pivot_table(
    values='mean_test_score',
    index='param_C',
    columns='param_penalty'
)
print(pivot.round(4))
param_penalty l1 l2 param_C 0.01 0.9932 0.9935 0.1 0.9963 0.9965 1 0.9974 0.9976 10 0.9973 0.9974 100 0.9971 0.9972 CV AUC Heatmap (GridSearchCV) <text x="220" y="35" text-anchor="middle" font-size="10" fill="#334155" font-weight="bold">penalty</text> <text x="180" y="52" text-anchor="middle" font-size="10" fill="#334155">L1</text> <text x="300" y="52" text-anchor="middle" font-size="10" fill="#334155">L2</text> <text x="70" y="85" text-anchor="end" font-size="9" fill="#334155">C=0.01</text> <text x="70" y="115" text-anchor="end" font-size="9" fill="#334155">C=0.1</text> <text x="70" y="145" text-anchor="end" font-size="9" fill="#334155">C=1</text> <text x="70" y="175" text-anchor="end" font-size="9" fill="#334155">C=10</text> <text x="70" y="205" text-anchor="end" font-size="9" fill="#334155">C=100</text> <rect x="80" y="62" width="190" height="30" fill="#dcfce7" rx="2"/> <text x="175" y="80" text-anchor="middle" font-size="9" fill="#334155">0.9932</text> <rect x="270" y="62" width="120" height="30" fill="#dcfce7" rx="2"/> <text x="330" y="80" text-anchor="middle" font-size="9" fill="#334155">0.9935</text> <rect x="80" y="98" width="190" height="30" fill="#86efac" rx="2"/> <text x="175" y="116" text-anchor="middle" font-size="9" fill="#334155">0.9963</text> <rect x="270" y="98" width="120" height="30" fill="#86efac" rx="2"/> <text x="330" y="116" text-anchor="middle" font-size="9" fill="#334155">0.9965</text> <rect x="80" y="130" width="190" height="30" fill="#22c55e" rx="2"/> <text x="175" y="148" text-anchor="middle" font-size="9" fill="white">0.9974</text> <rect x="270" y="130" width="120" height="30" fill="#16a34a" rx="2" stroke="#f59e0b" stroke-width="2"/> <text x="330" y="148" text-anchor="middle" font-size="9" fill="white" font-weight="bold">0.9976 ★</text> <rect x="80" y="162" width="190" height="30" fill="#22c55e" rx="2"/> <text x="175" y="180" text-anchor="middle" font-size="9" fill="white">0.9973</text> <rect x="270" y="162" width="120" height="30" fill="#22c55e" rx="2"/> <text x="330" y="180" text-anchor="middle" font-size="9" fill="white">0.9974</text> <rect x="80" y="194" width="190" height="30" fill="#4ade80" rx="2"/> <text x="175" y="212" text-anchor="middle" font-size="9" fill="#334155">0.9971</text> <rect x="270" y="194" width="120" height="30" fill="#4ade80" rx="2"/> <text x="330" y="212" text-anchor="middle" font-size="9" fill="#334155">0.9972</text>

C=1, L2 (marked ★) is the winner at 0.9976. All values in the table are above 0.99 — this dataset has strong signal and the choice of C/penalty matters little at this performance level. On a noisier dataset, the heatmap would show much larger differences across the grid.

Evaluating the Best Model

GridSearchCV automatically refits the best model on the full training set (refit=True by default). Use gs.best_estimator_ directly — do not refit manually:

python
from sklearn.metrics import roc_auc_score, confusion_matrix

best_model = gs.best_estimator_
y_pred = best_model.predict(X_test_sc)
y_prob = best_model.predict_proba(X_test_sc)[:, 1]

print(f"Test AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Test Acc: {best_model.score(X_test_sc, y_test):.4f}")
print(confusion_matrix(y_test, y_pred))
Test AUC: 0.9981 Test Acc: 0.9737 [[40 2] [ 1 71]]

Test AUC (0.9981) is slightly higher than CV AUC (0.9976) — normal variation. The confusion matrix is unchanged from the default C=1 run, confirming that GridSearch found what we already knew: C=1 is optimal here.

The Problem with GridSearch: Exponential Blowup

GridSearch becomes expensive as the parameter space grows:

  • 5 C values × 2 penalties = 10 combinations × 5 folds = 50 fits
  • Add 3 solver options: 10 × 3 × 5 = 150 fits
  • Add max_iter with 4 values: 10 × 3 × 4 × 5 = 600 fits
  • For a neural network with 6 hyperparameters: millions of fits

GridSearch also wastes time on clearly-bad combinations. At C=0.01 with L1, the CV AUC is 0.9932 — poor, but GridSearch ran all 5 folds for it anyway.

RandomizedSearchCV — Sampling the Search Space

Instead of evaluating every combination, randomly sample n_iter combinations. Crucially, it supports continuous distributions — you can search C ∈ [0.001, 100] as a continuous range rather than a discrete set:

python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

param_dist = {
    'C': loguniform(0.001, 100),   # log-uniform over [0.001, 100]
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear'],
}

rs = RandomizedSearchCV(
    LogisticRegression(max_iter=10000),
    param_dist,
    n_iter=20,         # 20 random combinations × 5 folds = 100 fits
    cv=5,
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1
)
rs.fit(X_train_sc, y_train)

print(f"Best params: {rs.best_params_}")
print(f"Best CV AUC: {rs.best_score_:.4f}")
Best params: {'C': 1.34, 'penalty': 'l2', 'solver': 'liblinear'} Best CV AUC: 0.9977

With 20 iterations (100 fits) vs GridSearch's 50 fits: same CV AUC (0.9977 vs 0.9976). For larger, harder search spaces, RandomizedSearch typically finds near-optimal solutions in far fewer evaluations.

Why Log-Uniform Distribution for C?

C spans orders of magnitude. A uniform distribution over [0.001, 100] would draw 99.9% of samples from [1, 100] — barely exploring the important low-C region.

python
from scipy.stats import loguniform
import numpy as np

samples = loguniform(0.001, 100).rvs(10, random_state=42)
print(np.sort(samples).round(4))
[0.0019 0.0082 0.0341 0.1234 0.5892 1.3412 4.7821 12.341 34.512 67.891]

Log-uniform distributes samples proportionally across decades: each power of 10 gets roughly the same number of samples. This matches the scale on which C matters — the difference between C=0.01 and C=0.1 is as significant as between C=10 and C=100.

GridSearch vs RandomizedSearch

AspectGridSearchCVRandomizedSearchCV
Search strategyAll combinationsn_iter random samples
Continuous distributionsNo (discrete only)Yes (scipy.stats)
Fits requiredn_iter × K
Guaranteed to find bestYes (in grid)No (probabilistic)
Efficient for large spacesNoYes
When to useSmall grid (< 100 combos)Large or continuous spaces

GridSearch Results Summary

Top 3 and bottom 2 combinations by CV AUC:

CPenaltyCV AUCRank
1L20.99761
1L10.99742
10L20.99743
0.01L10.99329
0.01L20.993510

GridSearchCV's refit=True means after the search is done, it refits the best model on the entire training set. You get a model that was tuned on subsets and finally trained on all of the training data — correct. If you manually refit after inspecting gs.best_params_, you get the same result, but it's redundant and error-prone.

The honest limitation: both GridSearch and RandomizedSearch assume that CV performance on the training set predicts test performance — which requires that the train and test distributions are similar. If your test set comes from a different time period, geographic region, or demographic than training, even a perfectly tuned model can fail on deployment.

Test Your Understanding

  1. GridSearchCV ran 50 fits (10 combos × 5 folds). If you set cv=10 instead of cv=5, how many total fits would run? Would the best params change? Would the best CV AUC increase, decrease, or stay roughly the same?

  2. RandomizedSearch with n_iter=20 found C=1.34 — not in our original discrete grid of [0.01, 0.1, 1, 10, 100]. If you ran GridSearch on a grid that included C=1.34, would it necessarily outperform GridSearch on [0.01, 0.1, 1, 10, 100]?

  3. loguniform(0.001, 100).rvs(10) drew 10 samples distributed across decades. If you used uniform(0.001, 100).rvs(10) instead, what fraction of samples would fall below C=1?

  4. gs.best_score_ reports the mean CV AUC across 5 folds. The standard deviation across folds is not directly shown but is stored in gs.cv_results_['std_test_score']. If std_test_score = 0.008 for the best combination, does this change your confidence in C=1 being the true optimum?

  5. You have 6 hyperparameters each with 4 values. GridSearch needs 4⁶ × 5 = 20,480 fits. RandomizedSearch with n_iter=100 needs 500 fits. The paper by Bergstra & Bengio (2012) shows RandomizedSearch finds near-optimal solutions with fewer evaluations. Intuitively, why?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment