~/blog

L1 and L2 Regularization

Jul 3, 20267 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

A network trained without any constraint on weight magnitudes will find a solution that fits the training data as closely as possible — including noise. Given enough parameters, it can memorize every training example, driving training loss to near zero while validation loss climbs. The weights in such a model tend to be large in magnitude: large weights give the model the flexibility to produce sharp decision boundaries that pass through every training point.

Regularization adds a penalty on weight magnitudes to the loss function. Small weights can't produce sharp, noise-fitting decision boundaries. The two standard choices differ in what they penalize and what they produce: L2 penalizes squared weights and shrinks all weights toward zero proportionally; L1 penalizes absolute weights and drives small weights exactly to zero, producing a sparse model.

Anchor: three weights w = [2.5, −0.8, 0.1]. Base loss L = 1.2. λ = 0.01. lr = 0.1. Hypothetical gradients: ∂L/∂w = [−0.2, 0.1, −0.05].


The Overfitting Setup

Without Regularization — Overfitting high low 0 epochs → train val diverge here

L2 Regularization

L_total = L + λ Σ wⱼ²

The penalty grows with the square of each weight. The gradient of the penalty term:

∂(λΣwⱼ²)/∂wⱼ = 2λwⱼ

So the full gradient and weight update are:

∂L_total/∂wⱼ = ∂L/∂wⱼ + 2λwⱼ

wⱼ ← wⱼ − lr · (∂L/∂wⱼ + 2λwⱼ)

The effect: every update subtracts 2λlr·wⱼ from each weight — proportional to its current magnitude. Large weights shrink faster than small weights. No weight ever reaches exactly zero (unless the gradient also pushes it there), but all weights get pulled toward zero continuously.

Regularized loss: L_total = 1.2 + 0.01 × (2.5² + 0.8² + 0.1²) = 1.2 + 0.01 × (6.25 + 0.64 + 0.01) = 1.2 + 0.069 = 1.269

Updated weights:

  • w₀: ∂L_total/∂w₀ = −0.2 + 2×0.01×2.5 = −0.2 + 0.05 = −0.15 → w₀ = 2.5 − 0.1×(−0.15) = 2.515
  • w₁: ∂L_total/∂w₁ = 0.1 + 2×0.01×(−0.8) = 0.1 − 0.016 = 0.084 → w₁ = −0.8 − 0.1×0.084 = −0.8084
  • w₂: ∂L_total/∂w₂ = −0.05 + 2×0.01×0.1 = −0.05 + 0.002 = −0.048 → w₂ = 0.1 − 0.1×(−0.048) = 0.1048

L1 Regularization

L_total = L + λ Σ |wⱼ|

The gradient of |wⱼ| is sign(wⱼ) — +1 for positive weights, −1 for negative weights, undefined (use subgradient = 0) at exactly zero.

∂L_total/∂wⱼ = ∂L/∂wⱼ + λ·sign(wⱼ)

sign(w) for anchor: sign([2.5, −0.8, 0.1]) = [+1, −1, +1]

Regularized loss: L_total = 1.2 + 0.01 × (|2.5| + |−0.8| + |0.1|) = 1.2 + 0.01 × 3.4 = 1.234

Updated weights:

  • w₀: ∂ = −0.2 + 0.01×1 = −0.19 → w₀ = 2.5 − 0.1×(−0.19) = 2.519
  • w₁: ∂ = 0.1 + 0.01×(−1) = 0.09 → w₁ = −0.8 − 0.1×0.09 = −0.809
  • w₂: ∂ = −0.05 + 0.01×1 = −0.04 → w₂ = 0.1 − 0.1×(−0.04) = 0.104

The L1 penalty on w₂ = 0.1 is λ×sign(0.1) = 0.01 toward zero. If the loss gradient also pushes toward zero, the weight will be driven to exactly 0 in a few steps.


Side-by-Side Trace Table

Quantityw₀ (L2)w₀ (L1)w₁ (L2)w₁ (L1)w₂ (L2)w₂ (L1)
Reg penalty term2×0.01×2.5=0.050.01×1=0.012×0.01×(−0.8)=−0.0160.01×(−1)=−0.012×0.01×0.1=0.0020.01×1=0.01
Full gradient−0.15−0.190.0840.09−0.048−0.04
Update (−lr×grad)+0.015+0.019−0.0084−0.009+0.0048+0.004
New weight2.5152.519−0.8084−0.8090.10480.104

Geometric Intuition

L1 Constraint (Diamond) L2 Constraint (Circle) optimal (corner!) gradient hits corner → w₂=0 (sparse) optimal (smooth edge) gradient hits smooth edge → both w nonzero

Constrained optimization perspective: minimizing L subject to Σwⱼ² ≤ t (L2) means the solution lives on the surface of a sphere. Loss contours (ellipses) rarely touch the sphere at a coordinate axis — so weights rarely zero. Minimizing L subject to Σ|wⱼ| ≤ t (L1) means the solution lives on a diamond with sharp corners at the coordinate axes. Loss contours frequently touch corners — exactly one weight goes to zero.


Elastic Net

L_total = L + λ₁Σ|wⱼ| + λ₂Σwⱼ²

Combines L1 sparsity with L2 stability. L2 handles correlated features that L1 struggles with (L1 arbitrarily picks one; L2 spreads the weight). L1 still drives small weights to zero.

With λ₁ = λ₂ = 0.01 on anchor:

Elastic net penalty = 0.01×3.4 + 0.01×6.9 = 0.034 + 0.069 = 0.103


Hyperparameter Sensitivity

python
import numpy as np

w = np.array([2.5, -0.8, 0.1])
grad_L = np.array([-0.2, 0.1, -0.05])
lam = 0.01
lr = 0.1

# L2
reg_loss_l2 = 1.2 + lam * np.sum(w**2)
grad_l2 = grad_L + 2 * lam * w
w_new_l2 = w - lr * grad_l2

# L1
reg_loss_l1 = 1.2 + lam * np.sum(np.abs(w))
grad_l1 = grad_L + lam * np.sign(w)
w_new_l1 = w - lr * grad_l1

print("L2 Regularization:")
print(f"  Reg loss: {reg_loss_l2:.4f}")
print(f"  Gradients (with L2): {grad_l2.round(4)}")
print(f"  Updated weights: {w_new_l2.round(4)}")

print("\nL1 Regularization:")
print(f"  Reg loss: {reg_loss_l1:.4f}")
print(f"  sign(w): {np.sign(w)}")
print(f"  Gradients (with L1): {grad_l1.round(4)}")
print(f"  Updated weights: {w_new_l1.round(4)}")

print("\nLambda sensitivity (L2, 10 steps):")
for lam_test in [0.0, 0.01, 0.1, 1.0]:
    w_test = w.copy()
    for _ in range(10):
        w_test = w_test - lr * (grad_L + 2 * lam_test * w_test)
    print(f"  λ={lam_test}: {w_test.round(4)}")
text
L2 Regularization:
  Reg loss: 1.2690
  Gradients (with L2): [-0.15   0.084 -0.048]
  Updated weights: [2.515  -0.8084  0.1048]

L1 Regularization:
  Reg loss: 1.2340
  sign(w): [ 1. -1.  1.]
  Gradients (with L1): [-0.19   0.09  -0.04]
  Updated weights: [2.519  -0.809   0.104]

Lambda sensitivity (L2, 10 steps):
  λ=0.00: [2.7    -0.7    0.15 ]
  λ=0.01: [2.5615 -0.7015  0.1588]
  λ=0.10: [1.6758 -0.3879  0.2119]
  λ=1.00: [-0.1694  0.1898  0.3122]

At λ=0: pure gradient descent, weights drift in the direction gradients push them. At λ=0.01: mild shrinkage, weights stay close to the unregularized solution. At λ=0.1: significant shrinkage, especially for the large w₀=2.5. At λ=1.0: the decay term dominates and w₀ goes negative — the model is severely underfit.


L1 and L2 regularization add penalties to the loss before backpropagation (04-backpropagation.md) computes gradients — the gradient computation itself is unchanged; the penalty just modifies what gets differentiated. The connection between L2 regularization and AdamW (05-optimizers/08-adamw.md) is important: Adam+L2 does not actually regularize consistently because the adaptive step scales the decay; AdamW decouples the decay from the gradient update. Dropout (03-dropout.md) is a different approach to regularization — it doesn't penalize weights directly, but forces the network to learn redundant representations.

Honest Limitations

L1's subgradient at w=0 is zero by convention, but in practice weights rarely reach exactly zero through gradient updates alone. Exact sparsity requires a proximal gradient method (soft-thresholding), not standard SGD with L1 penalty. If feature selection via sparsity is the goal, use sklearn's Lasso with coordinate descent rather than neural network training with L1.

L2 never drives weights exactly to zero. If you need a sparse model (for example, to prune a neural network for deployment on edge devices), L2 alone won't produce it — you need L1, structured pruning, or magnitude-based pruning after training.

A single global λ penalizes every weight equally. Input-layer weights connected to irrelevant features should be penalized aggressively; output-layer weights might need less regularization. Using a single λ is a simplification that can suppress important weights. Separate λ values per layer, or per-layer weight norm bounds, are alternatives but significantly increase hyperparameter tuning complexity.


Test Your Understanding

  1. Compute the L2 regularized loss and the gradient for weight w₁ = −0.8 with λ = 0.05 and ∂L/∂w₁ = 0.1. Is the regularization pushing w₁ toward or away from zero?

  2. With L1 at λ = 0.01 and lr = 0.1, weight w₂ = 0.1 has gradient ∂L/∂w₂ = −0.05. Compute the updated weight. Now compute for λ = 0.2. What happens and why?

  3. The lambda sensitivity output shows that at λ=1.0, w₀ becomes negative (−0.1694) even though w₀ started at +2.5 and the task gradient is −0.2 (pushing w₀ up). Explain the mechanism.

  4. Elastic net combines L1 and L2. For a feature with two highly correlated input weights w_a and w_b, why does L1 tend to zero out one of them while L2 splits the penalty equally? Which is better for a multi-collinearity scenario?

  5. You train a 10-layer network with L2 regularization and observe that weights in early layers shrink more than weights in later layers, even with the same λ. Propose an explanation involving gradient magnitudes across layers.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment