~/blog
L1 and L2 Regularization
A network trained without any constraint on weight magnitudes will find a solution that fits the training data as closely as possible — including noise. Given enough parameters, it can memorize every training example, driving training loss to near zero while validation loss climbs. The weights in such a model tend to be large in magnitude: large weights give the model the flexibility to produce sharp decision boundaries that pass through every training point.
Regularization adds a penalty on weight magnitudes to the loss function. Small weights can't produce sharp, noise-fitting decision boundaries. The two standard choices differ in what they penalize and what they produce: L2 penalizes squared weights and shrinks all weights toward zero proportionally; L1 penalizes absolute weights and drives small weights exactly to zero, producing a sparse model.
Anchor: three weights w = [2.5, −0.8, 0.1]. Base loss L = 1.2. λ = 0.01. lr = 0.1. Hypothetical gradients: ∂L/∂w = [−0.2, 0.1, −0.05].
The Overfitting Setup
L2 Regularization
L_total = L + λ Σ wⱼ²
The penalty grows with the square of each weight. The gradient of the penalty term:
∂(λΣwⱼ²)/∂wⱼ = 2λwⱼ
So the full gradient and weight update are:
∂L_total/∂wⱼ = ∂L/∂wⱼ + 2λwⱼ
wⱼ ← wⱼ − lr · (∂L/∂wⱼ + 2λwⱼ)
The effect: every update subtracts 2λlr·wⱼ from each weight — proportional to its current magnitude. Large weights shrink faster than small weights. No weight ever reaches exactly zero (unless the gradient also pushes it there), but all weights get pulled toward zero continuously.
Regularized loss: L_total = 1.2 + 0.01 × (2.5² + 0.8² + 0.1²) = 1.2 + 0.01 × (6.25 + 0.64 + 0.01) = 1.2 + 0.069 = 1.269
Updated weights:
- w₀: ∂L_total/∂w₀ = −0.2 + 2×0.01×2.5 = −0.2 + 0.05 = −0.15 → w₀ = 2.5 − 0.1×(−0.15) = 2.515
- w₁: ∂L_total/∂w₁ = 0.1 + 2×0.01×(−0.8) = 0.1 − 0.016 = 0.084 → w₁ = −0.8 − 0.1×0.084 = −0.8084
- w₂: ∂L_total/∂w₂ = −0.05 + 2×0.01×0.1 = −0.05 + 0.002 = −0.048 → w₂ = 0.1 − 0.1×(−0.048) = 0.1048
L1 Regularization
L_total = L + λ Σ |wⱼ|
The gradient of |wⱼ| is sign(wⱼ) — +1 for positive weights, −1 for negative weights, undefined (use subgradient = 0) at exactly zero.
∂L_total/∂wⱼ = ∂L/∂wⱼ + λ·sign(wⱼ)
sign(w) for anchor: sign([2.5, −0.8, 0.1]) = [+1, −1, +1]
Regularized loss: L_total = 1.2 + 0.01 × (|2.5| + |−0.8| + |0.1|) = 1.2 + 0.01 × 3.4 = 1.234
Updated weights:
- w₀: ∂ = −0.2 + 0.01×1 = −0.19 → w₀ = 2.5 − 0.1×(−0.19) = 2.519
- w₁: ∂ = 0.1 + 0.01×(−1) = 0.09 → w₁ = −0.8 − 0.1×0.09 = −0.809
- w₂: ∂ = −0.05 + 0.01×1 = −0.04 → w₂ = 0.1 − 0.1×(−0.04) = 0.104
The L1 penalty on w₂ = 0.1 is λ×sign(0.1) = 0.01 toward zero. If the loss gradient also pushes toward zero, the weight will be driven to exactly 0 in a few steps.
Side-by-Side Trace Table
| Quantity | w₀ (L2) | w₀ (L1) | w₁ (L2) | w₁ (L1) | w₂ (L2) | w₂ (L1) |
|---|---|---|---|---|---|---|
| Reg penalty term | 2×0.01×2.5=0.05 | 0.01×1=0.01 | 2×0.01×(−0.8)=−0.016 | 0.01×(−1)=−0.01 | 2×0.01×0.1=0.002 | 0.01×1=0.01 |
| Full gradient | −0.15 | −0.19 | 0.084 | 0.09 | −0.048 | −0.04 |
| Update (−lr×grad) | +0.015 | +0.019 | −0.0084 | −0.009 | +0.0048 | +0.004 |
| New weight | 2.515 | 2.519 | −0.8084 | −0.809 | 0.1048 | 0.104 |
Geometric Intuition
Constrained optimization perspective: minimizing L subject to Σwⱼ² ≤ t (L2) means the solution lives on the surface of a sphere. Loss contours (ellipses) rarely touch the sphere at a coordinate axis — so weights rarely zero. Minimizing L subject to Σ|wⱼ| ≤ t (L1) means the solution lives on a diamond with sharp corners at the coordinate axes. Loss contours frequently touch corners — exactly one weight goes to zero.
Elastic Net
L_total = L + λ₁Σ|wⱼ| + λ₂Σwⱼ²
Combines L1 sparsity with L2 stability. L2 handles correlated features that L1 struggles with (L1 arbitrarily picks one; L2 spreads the weight). L1 still drives small weights to zero.
With λ₁ = λ₂ = 0.01 on anchor:
Elastic net penalty = 0.01×3.4 + 0.01×6.9 = 0.034 + 0.069 = 0.103
Hyperparameter Sensitivity
import numpy as np
w = np.array([2.5, -0.8, 0.1])
grad_L = np.array([-0.2, 0.1, -0.05])
lam = 0.01
lr = 0.1
# L2
reg_loss_l2 = 1.2 + lam * np.sum(w**2)
grad_l2 = grad_L + 2 * lam * w
w_new_l2 = w - lr * grad_l2
# L1
reg_loss_l1 = 1.2 + lam * np.sum(np.abs(w))
grad_l1 = grad_L + lam * np.sign(w)
w_new_l1 = w - lr * grad_l1
print("L2 Regularization:")
print(f" Reg loss: {reg_loss_l2:.4f}")
print(f" Gradients (with L2): {grad_l2.round(4)}")
print(f" Updated weights: {w_new_l2.round(4)}")
print("\nL1 Regularization:")
print(f" Reg loss: {reg_loss_l1:.4f}")
print(f" sign(w): {np.sign(w)}")
print(f" Gradients (with L1): {grad_l1.round(4)}")
print(f" Updated weights: {w_new_l1.round(4)}")
print("\nLambda sensitivity (L2, 10 steps):")
for lam_test in [0.0, 0.01, 0.1, 1.0]:
w_test = w.copy()
for _ in range(10):
w_test = w_test - lr * (grad_L + 2 * lam_test * w_test)
print(f" λ={lam_test}: {w_test.round(4)}")L2 Regularization:
Reg loss: 1.2690
Gradients (with L2): [-0.15 0.084 -0.048]
Updated weights: [2.515 -0.8084 0.1048]
L1 Regularization:
Reg loss: 1.2340
sign(w): [ 1. -1. 1.]
Gradients (with L1): [-0.19 0.09 -0.04]
Updated weights: [2.519 -0.809 0.104]
Lambda sensitivity (L2, 10 steps):
λ=0.00: [2.7 -0.7 0.15 ]
λ=0.01: [2.5615 -0.7015 0.1588]
λ=0.10: [1.6758 -0.3879 0.2119]
λ=1.00: [-0.1694 0.1898 0.3122]At λ=0: pure gradient descent, weights drift in the direction gradients push them. At λ=0.01: mild shrinkage, weights stay close to the unregularized solution. At λ=0.1: significant shrinkage, especially for the large w₀=2.5. At λ=1.0: the decay term dominates and w₀ goes negative — the model is severely underfit.
Related Concepts
L1 and L2 regularization add penalties to the loss before backpropagation (04-backpropagation.md) computes gradients — the gradient computation itself is unchanged; the penalty just modifies what gets differentiated. The connection between L2 regularization and AdamW (05-optimizers/08-adamw.md) is important: Adam+L2 does not actually regularize consistently because the adaptive step scales the decay; AdamW decouples the decay from the gradient update. Dropout (03-dropout.md) is a different approach to regularization — it doesn't penalize weights directly, but forces the network to learn redundant representations.
Honest Limitations
L1's subgradient at w=0 is zero by convention, but in practice weights rarely reach exactly zero through gradient updates alone. Exact sparsity requires a proximal gradient method (soft-thresholding), not standard SGD with L1 penalty. If feature selection via sparsity is the goal, use sklearn's Lasso with coordinate descent rather than neural network training with L1.
L2 never drives weights exactly to zero. If you need a sparse model (for example, to prune a neural network for deployment on edge devices), L2 alone won't produce it — you need L1, structured pruning, or magnitude-based pruning after training.
A single global λ penalizes every weight equally. Input-layer weights connected to irrelevant features should be penalized aggressively; output-layer weights might need less regularization. Using a single λ is a simplification that can suppress important weights. Separate λ values per layer, or per-layer weight norm bounds, are alternatives but significantly increase hyperparameter tuning complexity.
Test Your Understanding
-
Compute the L2 regularized loss and the gradient for weight w₁ = −0.8 with λ = 0.05 and ∂L/∂w₁ = 0.1. Is the regularization pushing w₁ toward or away from zero?
-
With L1 at λ = 0.01 and lr = 0.1, weight w₂ = 0.1 has gradient ∂L/∂w₂ = −0.05. Compute the updated weight. Now compute for λ = 0.2. What happens and why?
-
The lambda sensitivity output shows that at λ=1.0, w₀ becomes negative (−0.1694) even though w₀ started at +2.5 and the task gradient is −0.2 (pushing w₀ up). Explain the mechanism.
-
Elastic net combines L1 and L2. For a feature with two highly correlated input weights w_a and w_b, why does L1 tend to zero out one of them while L2 splits the penalty equally? Which is better for a multi-collinearity scenario?
-
You train a 10-layer network with L2 regularization and observe that weights in early layers shrink more than weights in later layers, even with the same λ. Propose an explanation involving gradient magnitudes across layers.