← View series: machine learning
~/blog
GridSearchCV and RandomizedSearchCV
Every hyperparameter in logistic regression — regularization strength C, penalty type, solver — must be set before training. Getting them right requires searching the hyperparameter space systematically. GridSearchCV evaluates every combination exhaustively; RandomizedSearchCV samples from continuous distributions. Both use cross-validation to estimate generalization performance for each candidate.
Anchor dataset: Breast Cancer Wisconsin (continues from the implementation post).
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)What Is Hyperparameter Tuning?
Parameters are learned from data during training: the weights that minimize the loss. Hyperparameters control the learning process and must be set before training:
| Hyperparameter | What it controls | Typical range |
|---|---|---|
C | Regularization strength (inverse of ) | |
penalty | Type of regularization | l1, l2, elasticnet |
solver | Optimization algorithm | lbfgs, liblinear, saga |
max_iter | Convergence budget |
Choosing C by looking at validation performance on the full training set is wrong — the model has already seen that data. Cross-validation provides an honest estimate by training on a subset and evaluating on the held-out fold.
GridSearchCV — Exhaustive Search
Grid search tests every combination in a discrete parameter grid:
from sklearn.model_selection import GridSearchCV
param_grid = {
'C': [0.01, 0.1, 1, 10, 100],
'penalty': ['l1', 'l2'],
'solver': ['liblinear'] # liblinear supports both l1 and l2
}
gs = GridSearchCV(
LogisticRegression(max_iter=10000),
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1,
verbose=1
)
gs.fit(X_train_sc, y_train)
print(f"Best params: {gs.best_params_}")
print(f"Best CV AUC: {gs.best_score_:.4f}")Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best params: {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}
Best CV AUC: 0.9976
Total fits = 5 (C values) × 2 (penalties) × 5 (CV folds) = 50 fits.
Examining the Full Results Grid
import pandas as pd
results_df = pd.DataFrame(gs.cv_results_)
pivot = results_df.pivot_table(
values='mean_test_score',
index='param_C',
columns='param_penalty'
)
print(pivot.round(4))param_penalty l1 l2
param_C
0.01 0.9932 0.9935
0.1 0.9963 0.9965
1 0.9974 0.9976
10 0.9973 0.9974
100 0.9971 0.9972
<text x="220" y="35" text-anchor="middle" font-size="10" fill="#334155" font-weight="bold">penalty</text>
<text x="180" y="52" text-anchor="middle" font-size="10" fill="#334155">L1</text>
<text x="300" y="52" text-anchor="middle" font-size="10" fill="#334155">L2</text>
<text x="70" y="85" text-anchor="end" font-size="9" fill="#334155">C=0.01</text>
<text x="70" y="115" text-anchor="end" font-size="9" fill="#334155">C=0.1</text>
<text x="70" y="145" text-anchor="end" font-size="9" fill="#334155">C=1</text>
<text x="70" y="175" text-anchor="end" font-size="9" fill="#334155">C=10</text>
<text x="70" y="205" text-anchor="end" font-size="9" fill="#334155">C=100</text>
<rect x="80" y="62" width="190" height="30" fill="#dcfce7" rx="2"/>
<text x="175" y="80" text-anchor="middle" font-size="9" fill="#334155">0.9932</text>
<rect x="270" y="62" width="120" height="30" fill="#dcfce7" rx="2"/>
<text x="330" y="80" text-anchor="middle" font-size="9" fill="#334155">0.9935</text>
<rect x="80" y="98" width="190" height="30" fill="#86efac" rx="2"/>
<text x="175" y="116" text-anchor="middle" font-size="9" fill="#334155">0.9963</text>
<rect x="270" y="98" width="120" height="30" fill="#86efac" rx="2"/>
<text x="330" y="116" text-anchor="middle" font-size="9" fill="#334155">0.9965</text>
<rect x="80" y="130" width="190" height="30" fill="#22c55e" rx="2"/>
<text x="175" y="148" text-anchor="middle" font-size="9" fill="white">0.9974</text>
<rect x="270" y="130" width="120" height="30" fill="#16a34a" rx="2" stroke="#f59e0b" stroke-width="2"/>
<text x="330" y="148" text-anchor="middle" font-size="9" fill="white" font-weight="bold">0.9976 ★</text>
<rect x="80" y="162" width="190" height="30" fill="#22c55e" rx="2"/>
<text x="175" y="180" text-anchor="middle" font-size="9" fill="white">0.9973</text>
<rect x="270" y="162" width="120" height="30" fill="#22c55e" rx="2"/>
<text x="330" y="180" text-anchor="middle" font-size="9" fill="white">0.9974</text>
<rect x="80" y="194" width="190" height="30" fill="#4ade80" rx="2"/>
<text x="175" y="212" text-anchor="middle" font-size="9" fill="#334155">0.9971</text>
<rect x="270" y="194" width="120" height="30" fill="#4ade80" rx="2"/>
<text x="330" y="212" text-anchor="middle" font-size="9" fill="#334155">0.9972</text>
C=1, L2 (marked ★) is the winner at 0.9976. All values in the table are above 0.99 — this dataset has strong signal and the choice of C/penalty matters little at this performance level. On a noisier dataset, the heatmap would show much larger differences across the grid.
Evaluating the Best Model
GridSearchCV automatically refits the best model on the full training set (refit=True by default). Use gs.best_estimator_ directly — do not refit manually:
from sklearn.metrics import roc_auc_score, confusion_matrix
best_model = gs.best_estimator_
y_pred = best_model.predict(X_test_sc)
y_prob = best_model.predict_proba(X_test_sc)[:, 1]
print(f"Test AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Test Acc: {best_model.score(X_test_sc, y_test):.4f}")
print(confusion_matrix(y_test, y_pred))Test AUC: 0.9981
Test Acc: 0.9737
[[40 2]
[ 1 71]]
Test AUC (0.9981) is slightly higher than CV AUC (0.9976) — normal variation. The confusion matrix is unchanged from the default C=1 run, confirming that GridSearch found what we already knew: C=1 is optimal here.
The Problem with GridSearch: Exponential Blowup
GridSearch becomes expensive as the parameter space grows:
- 5 C values × 2 penalties = 10 combinations × 5 folds = 50 fits
- Add 3 solver options: 10 × 3 × 5 = 150 fits
- Add
max_iterwith 4 values: 10 × 3 × 4 × 5 = 600 fits - For a neural network with 6 hyperparameters: millions of fits
GridSearch also wastes time on clearly-bad combinations. At C=0.01 with L1, the CV AUC is 0.9932 — poor, but GridSearch ran all 5 folds for it anyway.
RandomizedSearchCV — Sampling the Search Space
Instead of evaluating every combination, randomly sample n_iter combinations. Crucially, it supports continuous distributions — you can search C ∈ [0.001, 100] as a continuous range rather than a discrete set:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform
param_dist = {
'C': loguniform(0.001, 100), # log-uniform over [0.001, 100]
'penalty': ['l1', 'l2'],
'solver': ['liblinear'],
}
rs = RandomizedSearchCV(
LogisticRegression(max_iter=10000),
param_dist,
n_iter=20, # 20 random combinations × 5 folds = 100 fits
cv=5,
scoring='roc_auc',
random_state=42,
n_jobs=-1
)
rs.fit(X_train_sc, y_train)
print(f"Best params: {rs.best_params_}")
print(f"Best CV AUC: {rs.best_score_:.4f}")Best params: {'C': 1.34, 'penalty': 'l2', 'solver': 'liblinear'}
Best CV AUC: 0.9977
With 20 iterations (100 fits) vs GridSearch's 50 fits: same CV AUC (0.9977 vs 0.9976). For larger, harder search spaces, RandomizedSearch typically finds near-optimal solutions in far fewer evaluations.
Why Log-Uniform Distribution for C?
C spans orders of magnitude. A uniform distribution over [0.001, 100] would draw 99.9% of samples from [1, 100] — barely exploring the important low-C region.
from scipy.stats import loguniform
import numpy as np
samples = loguniform(0.001, 100).rvs(10, random_state=42)
print(np.sort(samples).round(4))[0.0019 0.0082 0.0341 0.1234 0.5892 1.3412 4.7821 12.341 34.512 67.891]
Log-uniform distributes samples proportionally across decades: each power of 10 gets roughly the same number of samples. This matches the scale on which C matters — the difference between C=0.01 and C=0.1 is as significant as between C=10 and C=100.
GridSearch vs RandomizedSearch
| Aspect | GridSearchCV | RandomizedSearchCV |
|---|---|---|
| Search strategy | All combinations | n_iter random samples |
| Continuous distributions | No (discrete only) | Yes (scipy.stats) |
| Fits required | n_iter × K | |
| Guaranteed to find best | Yes (in grid) | No (probabilistic) |
| Efficient for large spaces | No | Yes |
| When to use | Small grid (< 100 combos) | Large or continuous spaces |
GridSearch Results Summary
Top 3 and bottom 2 combinations by CV AUC:
| C | Penalty | CV AUC | Rank |
|---|---|---|---|
| 1 | L2 | 0.9976 | 1 |
| 1 | L1 | 0.9974 | 2 |
| 10 | L2 | 0.9974 | 3 |
| 0.01 | L1 | 0.9932 | 9 |
| 0.01 | L2 | 0.9935 | 10 |
Related Concepts and Honest Limitations
GridSearchCV's refit=True means after the search is done, it refits the best model on the entire training set. You get a model that was tuned on subsets and finally trained on all of the training data — correct. If you manually refit after inspecting gs.best_params_, you get the same result, but it's redundant and error-prone.
The honest limitation: both GridSearch and RandomizedSearch assume that CV performance on the training set predicts test performance — which requires that the train and test distributions are similar. If your test set comes from a different time period, geographic region, or demographic than training, even a perfectly tuned model can fail on deployment.
Test Your Understanding
-
GridSearchCV ran 50 fits (10 combos × 5 folds). If you set
cv=10instead ofcv=5, how many total fits would run? Would the best params change? Would the best CV AUC increase, decrease, or stay roughly the same? -
RandomizedSearch with
n_iter=20found C=1.34 — not in our original discrete grid of [0.01, 0.1, 1, 10, 100]. If you ran GridSearch on a grid that included C=1.34, would it necessarily outperform GridSearch on [0.01, 0.1, 1, 10, 100]? -
loguniform(0.001, 100).rvs(10)drew 10 samples distributed across decades. If you useduniform(0.001, 100).rvs(10)instead, what fraction of samples would fall below C=1? -
gs.best_score_reports the mean CV AUC across 5 folds. The standard deviation across folds is not directly shown but is stored ings.cv_results_['std_test_score']. Ifstd_test_score = 0.008for the best combination, does this change your confidence in C=1 being the true optimum? -
You have 6 hyperparameters each with 4 values. GridSearch needs 4⁶ × 5 = 20,480 fits. RandomizedSearch with
n_iter=100needs 500 fits. The paper by Bergstra & Bengio (2012) shows RandomizedSearch finds near-optimal solutions with fewer evaluations. Intuitively, why?