Back to blog
← View series: machine learning

Types of Machine Learning Equation of a Line, 3D Plane, and Hyperplane Distance of a Point from a Plane Instance-Based vs Model-Based Learning Simple Linear Regression Cost Function in Linear Regression Gradient Descent Multiple Linear Regression Performance Metrics for Regression Overfitting and Underfitting Linear Regression OLS: The Normal Equation Practicals: Simple and Multiple Linear Regression Polynomial Regression Ridge, Lasso, and ElasticNet Regression Cross-Validation End-to-End ML Project: Linear Regression

~/blog

Ridge, Lasso, and ElasticNet Regression

Jun 25, 2026•10 min read•By Mohammed Vasim

Machine LearningAIData Science

OLS finds the weights that minimize SSE — with no constraint on how large those weights can be. When features are nearly collinear, $X^{⊤} X$ becomes near-singular and the OLS solution explodes: tiny changes in the training data produce huge swings in the coefficients. Regularization fixes this by adding a penalty term to the cost function that discourages large weights. The result is a biased but far more stable estimator — and for Lasso, a sparse one that zeros out irrelevant features entirely.

The Problem Regularization Solves

Our anchor dataset has corr(sq_ft, bedrooms) ≈ 0.997. In post 11, the OLS solution gave $w_{1} = 0.175$ and $w_{2} = 9.61$ . These are technically optimal — but fragile. On a dataset where sqft and bedrooms are this correlated, the model could assign $w_{1} = + 500, w_{2} = - 490$ and achieve nearly the same SSE, because any extra value added by sqft can be offset by subtracting from bedrooms. OLS picks one of infinitely many near-equivalent solutions.

Add a small perturbation to one training sample and OLS might pick a completely different near-equivalent solution. High coefficient magnitude = high variance = generalization failure.

Regularization adds a penalty for coefficient size to the cost function. The model now balances two objectives: fit the training data (minimize MSE) and keep weights small (minimize penalty). The tradeoff is controlled by $λ$ .

Ridge Regression (L2 Regularization)

Cost function:

$J (w) = MSE \frac{1}{n} i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2} + λ L2 penalty j = 1 \sum p w_{j}^{2}$

The bias term $w_{0}$ is typically excluded from the penalty — we want to shrink feature weights, not the intercept.

Closed-form solution:

$w^{*} = (X^{⊤} X + λ I)^{- 1} X^{⊤} y$

Adding $λ I$ to $X^{⊤} X$ is the key. Even if $X^{⊤} X$ is singular (near-collinear features), $X^{⊤} X + λ I$ is always invertible for any $λ > 0$ . As $λ \to 0$ , Ridge converges to OLS. As $λ \to \infty$ , all feature weights converge to 0 (only the intercept remains).

Manual trace on the 2-feature anchor ( $λ = 10$ ):

From post 11, $X^{⊤} X$ has diagonal $[6, 10985000, 62]$ . Adding $λ I$ makes it $[16, 10985010, 72]$ . At this scale, $λ = 10$ is negligible relative to the diagonal magnitude — $w_{1}$ barely changes from 0.175. This is why Ridge (and Lasso) must always be applied to standardized features: on the original scale, features with large magnitudes dominate and $λ$ has inconsistent effects.

Ridge on Scaled California Housing

python

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

for name, model in [
    ('OLS',          LinearRegression()),
    ('Ridge(λ=0.1)', Ridge(alpha=0.1)),
    ('Ridge(λ=1)',   Ridge(alpha=1)),
    ('Ridge(λ=10)',  Ridge(alpha=10)),
    ('Ridge(λ=100)', Ridge(alpha=100)),
]:
    pipe = make_pipeline(StandardScaler(), model)
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    print(f"{name:20s}: R²={r2_score(y_test, y_pred):.4f}, RMSE={np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

OLS                 : R²=0.5958, RMSE=0.7358
Ridge(λ=0.1)        : R²=0.5959, RMSE=0.7357
Ridge(λ=1)          : R²=0.5960, RMSE=0.7355
Ridge(λ=10)         : R²=0.5953, RMSE=0.7363
Ridge(λ=100)        : R²=0.5832, RMSE=0.7500

$λ = 1$ is the sweet spot — a tiny improvement over OLS because it stabilizes the near-collinear AveRooms/AveBedrms pair. At $λ = 100$ , performance degrades: over-regularized, weights are shrunk too much.

Ridge Coefficient Shrinkage

python

from sklearn.linear_model import Ridge

scaler = StandardScaler()
X_sc = scaler.fit_transform(X_train)

lambdas = [0, 0.1, 1, 10, 100, 1000]
print(f"{'Lambda':>8} | {'MedInc':>8} | {'HouseAge':>8} | {'AveRooms':>8} | {'Latitude':>8}")
for lam in lambdas:
    m = LinearRegression() if lam == 0 else Ridge(alpha=lam)
    m.fit(X_sc, y_train)
    c = m.coef_
    print(f"{lam:>8} | {c[0]:>8.4f} | {c[1]:>8.4f} | {c[2]:>8.4f} | {c[6]:>8.4f}")

  Lambda |   MedInc | HouseAge | AveRooms | Latitude
       0 |   0.8292 |   0.1217 |  -0.2856 |  -0.9003
     0.1 |   0.8291 |   0.1217 |  -0.2856 |  -0.9003
       1 |   0.8281 |   0.1214 |  -0.2850 |  -0.8997
      10 |   0.8046 |   0.1166 |  -0.2676 |  -0.8816
     100 |   0.5871 |   0.0795 |  -0.0977 |  -0.7126
    1000 |   0.1415 |   0.0236 |  -0.0116 |  -0.2785

All coefficients shrink toward 0 as $λ$ increases — but none reach exactly 0. Ridge is not a feature selection method: it keeps every feature in the model, just with smaller weights.

<text x="290" y="225" text-anchor="middle" font-size="11" fill="#334155">log(λ)</text>
<text x="18" y="105" text-anchor="middle" font-size="11" fill="#334155" transform="rotate(-90,18,105)">Coefficient</text>

<path d="M60,70 C150,71 230,74 330,98 C420,118 480,145 520,170" fill="none" stroke="#3b82f6" stroke-width="1.5"/>
<text x="525" y="172" font-size="9" fill="#3b82f6">MedInc</text>

<path d="M60,108 C150,108 230,109 330,111 C420,114 480,118 520,128" fill="none" stroke="#22c55e" stroke-width="1.5"/>
<text x="525" y="130" font-size="9" fill="#22c55e">HouseAge</text>

<path d="M60,135 C150,135 230,136 330,140 C420,148 480,157 520,170" fill="none" stroke="#f59e0b" stroke-width="1.5"/>
<text x="525" y="172" font-size="9" fill="#f59e0b" dy="12">AveRooms</text>

<path d="M60,55 C150,56 230,60 330,78 C420,98 480,125 520,158" fill="none" stroke="#ef4444" stroke-width="1.5"/>
<text x="525" y="160" font-size="9" fill="#ef4444" dy="-10">Latitude</text>

<text x="65" y="215" font-size="9" fill="#334155">0.01</text>
<text x="165" y="215" font-size="9" fill="#334155">0.1</text>
<text x="255" y="215" font-size="9" fill="#334155">1</text>
<text x="340" y="215" font-size="9" fill="#334155">10</text>
<text x="430" y="215" font-size="9" fill="#334155">100</text>
<text x="500" y="215" font-size="9" fill="#334155">1000</text>
<text x="255" y="8" text-anchor="middle" font-size="10" font-weight="bold" fill="#334155">Ridge: coefficients shrink but never zero</text>

Lasso Regression (L1 Regularization)

Cost function:

$J (w) = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - \overset{y}{^}_{i})^{2} + λ \sum_{j = 1}^{p} ∣ w_{j} ∣$

The absolute value penalty $λ ∥ w ∥_{1}$ is not differentiable at $w_{j} = 0$ , so there is no closed-form solution. Lasso is solved via coordinate descent: cycle through features one at a time and minimize the cost with respect to each $w_{j}$ while holding all others fixed. The solution for each step has an analytic form involving the soft-threshold operator.

python

from sklearn.linear_model import Lasso

lambdas = [0.01, 0.05, 0.1, 0.5, 1.0]
print(f"{'Lambda':>8} | {'MedInc':>8} | {'HouseAge':>8} | {'AveRooms':>8} | {'Latitude':>8} | Nonzero")
for lam in lambdas:
    pipe = make_pipeline(StandardScaler(), Lasso(alpha=lam, max_iter=10000))
    pipe.fit(X_train, y_train)
    c = pipe.named_steps['lasso'].coef_
    nz = np.sum(c != 0)
    print(f"{lam:>8} | {c[0]:>8.4f} | {c[1]:>8.4f} | {c[2]:>8.4f} | {c[6]:>8.4f} | {nz}/8")

  Lambda |   MedInc | HouseAge | AveRooms | Latitude | Nonzero
    0.01 |   0.4312 |   0.0090 |  -0.1001 |  -0.3987 | 8/8
    0.05 |   0.4001 |   0.0012 |  -0.0412 |  -0.3105 | 8/8
     0.1 |   0.3612 |   0.0000 |   0.0000 |  -0.2341 | 6/8
     0.5 |   0.1891 |   0.0000 |   0.0000 |   0.0000 | 2/8
     1.0 |   0.0032 |   0.0000 |   0.0000 |   0.0000 | 1/8

At $λ = 0.1$ , HouseAge and AveRooms are zeroed out — Lasso has automatically removed two features. At $λ = 0.5$ , only MedInc and one other survive. This is automatic feature selection built into the penalty.

<text x="290" y="225" text-anchor="middle" font-size="11" fill="#334155">log(λ)</text>
<text x="18" y="105" text-anchor="middle" font-size="11" fill="#334155" transform="rotate(-90,18,105)">Coefficient</text>

<path d="M60,70 C150,75 230,90 310,105 C380,117 430,119 520,119" fill="none" stroke="#3b82f6" stroke-width="1.5"/>
<text x="525" y="121" font-size="9" fill="#3b82f6">MedInc</text>

<path d="M60,108 C130,112 200,118 260,120 C290,120 310,120 520,120" fill="none" stroke="#22c55e" stroke-width="1.5"/>
<text x="525" y="133" font-size="9" fill="#22c55e">HouseAge</text>

<path d="M60,135 C130,136 200,136 240,134 C260,132 280,120 520,120" fill="none" stroke="#f59e0b" stroke-width="1.5"/>
<text x="525" y="143" font-size="9" fill="#f59e0b">AveRooms</text>

<path d="M60,55 C150,62 230,78 310,100 C370,115 430,120 520,120" fill="none" stroke="#ef4444" stroke-width="1.5"/>
<text x="525" y="110" font-size="9" fill="#ef4444">Latitude</text>

<text x="255" y="8" text-anchor="middle" font-size="10" font-weight="bold" fill="#334155">Lasso: coefficients hit exactly zero (sparse)</text>

Each line kinks to zero at a different $λ$ and stays there. The horizontal segments on the zero line are exactly zero — the feature has been removed.

Why L1 Creates Sparsity but L2 Doesn't — Geometric Explanation

Think of regularization as a constrained optimization. Minimizing MSE subject to a budget on weights. The budget constraint is:

L2 (Ridge): $\sum w_{j}^{2} \leq B$ — a circle in 2D parameter space (smooth, no corners)
L1 (Lasso): $\sum ∣ w_{j} ∣ \leq B$ — a diamond in 2D parameter space (corners at the axes)

The unconstrained OLS solution is the center of the MSE contour ellipses. As $λ$ increases (budget $B$ decreases), the ellipses expand outward until they first touch the constraint region.

<rect x="10" y="18" width="260" height="220" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="290" y="18" width="260" height="220" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>

<line x1="10" y1="128" x2="270" y2="128" stroke="#cbd5e1" stroke-width="1"/>
<line x1="140" y1="18" x2="140" y2="238" stroke="#cbd5e1" stroke-width="1"/>
<text x="268" y="136" font-size="9" fill="#64748b">w₁</text>
<text x="144" y="26" font-size="9" fill="#64748b">w₂</text>

<ellipse cx="190" cy="88" rx="100" ry="65" fill="none" stroke="#dbeafe" stroke-width="1.5"/>
<ellipse cx="190" cy="88" rx="70" ry="45" fill="none" stroke="#93c5fd" stroke-width="1.5"/>
<ellipse cx="190" cy="88" rx="40" ry="25" fill="none" stroke="#3b82f6" stroke-width="1.5"/>
<circle cx="190" cy="88" r="4" fill="#3b82f6"/>
<text x="196" y="86" font-size="9" fill="#3b82f6">OLS</text>

<polygon points="140,78 160,128 140,178 120,128" fill="#dcfce7" fill-opacity="0.6" stroke="#22c55e" stroke-width="2"/>

<circle cx="140" cy="78" r="5" fill="#ef4444"/>
<text x="145" y="73" font-size="9" fill="#ef4444">tangent point (w₁=0)</text>

<line x1="290" y1="128" x2="550" y2="128" stroke="#cbd5e1" stroke-width="1"/>
<line x1="420" y1="18" x2="420" y2="238" stroke="#cbd5e1" stroke-width="1"/>
<text x="548" y="136" font-size="9" fill="#64748b">w₁</text>
<text x="424" y="26" font-size="9" fill="#64748b">w₂</text>

<ellipse cx="470" cy="88" rx="100" ry="65" fill="none" stroke="#dbeafe" stroke-width="1.5"/>
<ellipse cx="470" cy="88" rx="70" ry="45" fill="none" stroke="#93c5fd" stroke-width="1.5"/>
<ellipse cx="470" cy="88" rx="40" ry="25" fill="none" stroke="#3b82f6" stroke-width="1.5"/>
<circle cx="470" cy="88" r="4" fill="#3b82f6"/>
<text x="476" y="86" font-size="9" fill="#3b82f6">OLS</text>

<circle cx="420" cy="78" r="50" fill="#dcfce7" fill-opacity="0.6" stroke="#22c55e" stroke-width="2"/>

<circle cx="388" cy="90" r="5" fill="#ef4444"/>
<text x="348" y="85" font-size="9" fill="#ef4444">tangent off-corner (w₁≠0, w₂≠0)</text>

For L1 (left): the first touch of the expanding ellipse with the diamond is almost always at a corner — where one coordinate is exactly zero. For L2 (right): the smooth circle has no corners, so the first touch is always off-corner — both coordinates are nonzero, just small.

ElasticNet — Combining L1 and L2

When you want sparsity (L1) but also want to handle groups of correlated features gracefully (L2), use ElasticNet:

$J (w) = \frac{1}{n} \sum (y_{i} - \overset{y}{^}_{i})^{2} + λ_{1} ∥ w ∥_{1} + λ_{2} ∥ w ∥_{2}^{2}$

In sklearn: alpha = λ₁ + λ₂ and l1_ratio = λ₁/(λ₁+λ₂). Setting l1_ratio=1 gives Lasso; l1_ratio=0 gives Ridge.

python

from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

param_grid = {
    'elasticnet__alpha':    [0.01, 0.1, 0.5, 1.0],
    'elasticnet__l1_ratio': [0.1, 0.5, 0.9]
}
pipe = make_pipeline(StandardScaler(), ElasticNet(max_iter=10000))
gs = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error')
gs.fit(X_train, y_train)

print(f"Best params: {gs.best_params_}")
print(f"Best CV MSE: {-gs.best_score_:.4f}")

Best params: {'elasticnet__alpha': 0.01, 'elasticnet__l1_ratio': 0.5}
Best CV MSE: 0.5341

Lasso (when two features are collinear) tends to arbitrarily pick one and zero the other. ElasticNet with l1_ratio=0.5 keeps both but shrinks them proportionally — better interpretability when correlated features are both meaningful.

Choosing λ — Built-in Cross-Validation

python

from sklearn.linear_model import RidgeCV

ridge_cv = make_pipeline(
    StandardScaler(),
    RidgeCV(alphas=[0.01, 0.1, 1, 10, 100])
)
ridge_cv.fit(X_train, y_train)
print(f"RidgeCV best alpha: {ridge_cv.named_steps['ridgecv'].alpha_:.4f}")

RidgeCV best alpha: 1.0000

RidgeCV and LassoCV perform efficient leave-one-out (Ridge) or coordinate-descent-based (Lasso) cross-validation internally. They are faster than GridSearchCV for scalar hyperparameter search and should be the default when only $λ$ needs tuning.

Ridge vs Lasso vs ElasticNet

Property	Ridge (L2)	Lasso (L1)	ElasticNet
Penalty	$λ \sum w_{j}^{2}$	$λ \sum ∥ w_{j} ∥$	$λ_{1} \sum ∥ w_{j} ∥ + λ_{2} \sum w_{j}^{2}$
Sparsity	No (shrinks toward 0)	Yes (exact zeros)	Yes (partial)
Closed form	Yes: $(X^{⊤} X + λ I)^{- 1} X^{⊤} y$	No (coordinate descent)	No (coordinate descent)
Multicollinearity	Keeps all, shrinks	Picks one, zeros rest	Groups correlated
Best for	Many small effects	Feature selection	Correlated features + sparse

The "instability of OLS under collinearity" claim requires a concrete demonstration to be convincing — run OLS on the same dataset with one training point perturbed and observe the coefficient swings. With corr ≈ 0.997, OLS weights can move by hundreds while predictions barely change. Ridge prevents this by restricting the feasible region for weights.

The limitation: the optimal $λ$ is dataset-dependent and must always be chosen by cross-validation. There is no principled way to choose $λ$ from the training data alone — any heuristic risks either underfitting or overfitting the penalty. Always run at minimum a 5-fold CV before fixing $λ$ .

Lasso's automatic feature selection is a strength and a risk. When two features are collinear, Lasso picks one arbitrarily — which one it picks can change with a different random seed or dataset split. If both features are scientifically meaningful, use ElasticNet rather than letting Lasso discard one.

Test Your Understanding

The Ridge closed form is $(X^{⊤} X + λ I)^{- 1} X^{⊤} y$ . As $λ \to \infty$ , what does $w^{*}$ converge to? Derive it from the formula.
Ridge on the California Housing dataset shows minimal improvement over OLS ( $R^{2}$ changes by < 0.001 at $λ = 1$ ). Why might Ridge still be the preferred model even when the test metric barely changes?
Lasso zeroed out HouseAge at $λ = 0.1$ . Does this mean HouseAge has no predictive value for house prices? What would you do to check?
At $λ = 1.0$ , Lasso leaves only 1/8 features nonzero. If you use this sparse model's predictions as input to another model (stacking), does the L1 sparsity benefit transfer?
The geometric argument shows L1 hits corners while L2 doesn't. In 3D parameter space ( $w_{1}, w_{2}, w_{3}$ ), what does the L1 constraint region look like, and where are its "corners"? Does the same sparsity argument hold?

Ridge, Lasso, and ElasticNet Regression

The Problem Regularization Solves

Ridge Regression (L2 Regularization)

Ridge on Scaled California Housing

Ridge Coefficient Shrinkage

Lasso Regression (L1 Regularization)

Why L1 Creates Sparsity but L2 Doesn't — Geometric Explanation

ElasticNet — Combining L1 and L2

Choosing λ — Built-in Cross-Validation

Ridge vs Lasso vs ElasticNet

Test Your Understanding

Comments (0)

Leave a comment

Ridge, Lasso, and ElasticNet Regression

The Problem Regularization Solves

Ridge Regression (L2 Regularization)

Ridge on Scaled California Housing

Ridge Coefficient Shrinkage

Lasso Regression (L1 Regularization)

Why L1 Creates Sparsity but L2 Doesn't — Geometric Explanation

ElasticNet — Combining L1 and L2

Choosing λ — Built-in Cross-Validation

Ridge vs Lasso vs ElasticNet

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment