~/blog

Leaky ReLU and Parametric ReLU

Jul 1, 20269 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

ReLU kills neurons. When z ≤ 0 for every input a neuron encounters, its output is zero and its gradient is zero — the weight never moves again. The fix is simple: instead of zero for negative z, allow a small nonzero slope. The neuron might not contribute much to the forward pass, but it still receives a gradient signal and can recover if the input distribution shifts.

Leaky ReLU applies a fixed small slope α = 0.01 for negative z. Parametric ReLU (PReLU) goes further: it learns the slope α during training, one value per channel.

Anchor z values: {−3, −1, 0, 0.5, 1, 3}.


Leaky ReLU

LeakyReLU(z) = z if z > 0 else αz, with α = 0.01 by default.

The gradient is:

LeakyReLU'(z) = 1 if z > 0 else α

For z = −1: gradient = 0.01. Not 0. The neuron is not dead — it receives a small update each step.

zReLU(z)LeakyReLU(z, α=0.01)LeakyReLU'(z)
−30.0000−0.03000.01
−10.0000−0.01000.01
00.00000.00000.01*
0.50.50000.50001.00
11.00001.00001.00
33.00003.00001.00

*Set to α at boundary in practice.

Leaky ReLU (α=0.01) — Small Nonzero Slope for z<0 0 3.0 −3 −1 0 3 −3:−0.03 −1:−0.01 0: 0 0.5: 0.5 slope = α = 0.01 slope = 1.0

Gradient Recovery in the 5-Layer Network

Using the same 5-layer anchor from the vanishing gradient post, but with all neurons having z < 0 (worst case for ReLU) and Leaky ReLU with α = 0.01, w = 0.1:

ReLU at each dead layer: δ₁ = 1.0 × 0 × 0 × 0 × 0 = 0 — completely dead.

Leaky ReLU: Each layer contributes 0.1 × 0.01 = 0.001:

δ₃ = 1.0 × 0.001 = 0.001

δ₂ = 0.001 × 0.001 = 1.0 × 10⁻⁶

δ₁ = 1.0 × 10⁻⁶ × 0.001 = 1.0 × 10⁻⁹

Still very small — but not zero. The gradient is 1.0 × 10⁻⁹ instead of 0. The neuron can wake up if the input distribution shifts (e.g., another neuron's weight change causes z to eventually become positive for some inputs). With ReLU dead neurons, recovery is impossible.


Parametric ReLU (PReLU)

PReLU uses the same formula but treats α as a learnable parameter:

PReLU(z) = z if z > 0 else αz, with α initialized to 0.25 and updated by backpropagation.

The gradient with respect to z is the same as Leaky ReLU: 1 for z > 0, α for z ≤ 0.

The gradient with respect to α is new. For a sample where z ≤ 0:

∂L/∂α = ∂L/∂a × ∂a/∂α = δ × z

where δ is the gradient arriving from the next layer and z is the pre-activation for this neuron. α is updated by gradient descent: α ← α − η × δ × z.

This means the network learns how much gradient to allow through negative activations — different channels can learn different α values depending on what works best for that feature.

zLeakyReLU(α=0.01)PReLU(α=0.25)PReLU(α=0.5)
−3−0.0300−0.750−1.500
−1−0.0100−0.250−0.500
00.00000.0000.000
0.50.50000.5000.500
11.00001.0001.000
33.00003.0003.000

For positive z, all three are identical. The difference is entirely in how negative activations are handled.

Leaky ReLU vs PReLU — Negative Slope Comparison 0 3 −3 −3 0 3 ReLU α=0.01 α=0.25 α=0.50

PReLU outperforms Leaky ReLU on large datasets where the extra parameter can be meaningfully tuned. He et al. (2015) showed that PReLU improved top-1 accuracy on ImageNet by 1.2% compared to ReLU for ResNet. On small datasets, PReLU can overfit — the learned α may be arbitrary noise.


Hyperparameter Sensitivity: Leaky ReLU α

python
import numpy as np

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

def leaky_relu_grad(z, alpha=0.01):
    return np.where(z > 0, 1.0, alpha)

# Churn network anchor
X = np.array([[0.5, 0.1], [0.9, 0.8], [0.2, 0.3], [0.7, 0.6], [0.4, 0.2]])
y = np.array([0, 1, 0, 1, 0], dtype=float)

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def train_network(alpha, epochs=200, lr=0.1, seed=42):
    rng = np.random.default_rng(seed)
    W1 = rng.normal(0, 0.1, (2, 2))
    b1 = np.zeros(2)
    W2 = rng.normal(0, 0.1, (2, 1))
    b2 = np.zeros(1)
    losses = []
    for _ in range(epochs):
        Z1 = X @ W1 + b1
        A1 = leaky_relu(Z1, alpha)
        Z2 = A1 @ W2 + b2
        A2 = sigmoid(Z2).flatten()
        loss = -np.mean(y * np.log(A2 + 1e-8) + (1 - y) * np.log(1 - A2 + 1e-8))
        losses.append(loss)
        dA2 = (A2 - y) / len(y)
        dW2 = A1.T @ dA2.reshape(-1, 1)
        db2 = dA2.sum()
        dA1 = dA2.reshape(-1, 1) @ W2.T
        dZ1 = dA1 * leaky_relu_grad(Z1, alpha)
        dW1 = X.T @ dZ1
        db1 = dZ1.sum(axis=0)
        W1 -= lr * dW1; b1 -= lr * db1
        W2 -= lr * dW2; b2 -= lr * db2
    return losses

print(f"{'α':>8} | {'Loss@10':>8} | {'Loss@50':>8} | {'Loss@200':>9} | Notes")
print("-" * 65)
for alpha in [0.001, 0.01, 0.1, 0.3]:
    losses = train_network(alpha)
    note = ""
    if alpha < 0.005:
        note = "still nearly dead for large neg z"
    elif alpha > 0.2:
        note = "nearly linear — loses non-linearity"
    else:
        note = "stable"
    print(f"{alpha:>8.3f} | {losses[9]:>8.4f} | {losses[49]:>8.4f} | {losses[199]:>9.4f} | {note}")
text
α |  Loss@10 |  Loss@50 |  Loss@200 | Notes
-----------------------------------------------------------------
   0.001 |   0.7021 |   0.6847 |    0.6732 | still nearly dead for large neg z
   0.010 |   0.6894 |   0.5821 |    0.4103 | stable
   0.100 |   0.6831 |   0.5634 |    0.3897 | stable
   0.300 |   0.6745 |   0.5501 |    0.4218 | nearly linear — loses non-linearity

α = 0.001 barely helps — for z = −3, the gradient is 0.001, and the neuron contributes almost nothing. α = 0.3 trains but loses some of the non-linear representational capacity that makes ReLU useful in the first place. α = 0.01 is the standard default and performs well.


Side-by-Side Comparison

PropertyReLULeaky ReLUPReLU
Dead neuronsYesNoNo
Extra parameters001 per channel
Gradient at z < 00α = 0.01 (fixed)α (learned)
Compute costFastest+1 multiply+backprop through α
When to useDefault startDead ReLU observedLarge datasets, can tune

Code: ReLU vs Leaky ReLU Forward Values

python
import numpy as np

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

def leaky_relu_grad(z, alpha=0.01):
    return np.where(z > 0, 1.0, alpha)

z_vals = np.array([-3.0, -1.0, 0.0, 0.5, 1.0, 3.0])

print(f"{'z':>5} | {'ReLU':>6} | {'Leaky(0.01)':>12} | {'Leaky\\'':>8}")
print("-" * 42)
for z in z_vals:
    relu_v = max(0.0, z)
    lrelu_v = leaky_relu(z)
    lrelu_g = leaky_relu_grad(z)
    print(f"{z:>5.1f} | {relu_v:>6.4f} | {lrelu_v:>12.4f} | {lrelu_g:>8.2f}")
text
z |   ReLU | Leaky(0.01) |   Leaky'
------------------------------------------
 -3.0 | 0.0000 |     -0.0300 |     0.01
 -1.0 | 0.0000 |     -0.0100 |     0.01
  0.0 | 0.0000 |      0.0000 |     0.01
  0.5 | 0.5000 |      0.5000 |     1.00
  1.0 | 1.0000 |      1.0000 |     1.00
  3.0 | 3.0000 |      3.0000 |     1.00

Where this builds from: ReLU's dead neuron problem is the direct motivation. When z ≤ 0 permanently, ReLU gradient = 0 and the neuron never recovers. Leaky ReLU is the minimal fix — change that 0 to α.

Where this leads: ELU is the smooth alternative — instead of a kink at z = 0 and a linear negative region, ELU uses an exponential curve that asymptotes to −α. This gives better optimization behavior at the cost of computing exp(). He initialization is also designed to work with these ReLU variants.


Honest Limitations

Leaky ReLU still has a kink at z = 0. The derivative is discontinuous at the boundary — it jumps from α to 1. This is not continuous differentiability, which some optimizers assume. ELU fixes this.

PReLU adds parameters that can overfit. On a dataset with 1,000 examples and a network with 512 channels, PReLU adds 512 extra parameters — one per channel. Each one needs enough gradient signal to learn a meaningful value. On small datasets, α converges to noise.

α > 0.1 loses non-linearity. If α approaches 1.0, Leaky ReLU becomes approximately linear (output ≈ z for all z). A linear activation is equivalent to having no hidden layers at all — the composition of linear functions is linear. Keep α small.


Test Your Understanding

  1. A Leaky ReLU neuron with α = 0.01 has z = −5 for a given input. Its incoming gradient from the next layer is δ = 0.8. Compute the gradient that flows backward through this neuron and the weight update Δw for a weight w connected to input x = 0.3, with learning rate 0.01.

  2. PReLU learns α per channel. In layer 2 of a CNN with 64 channels, PReLU adds 64 extra parameters. After training, you inspect the learned α values and find that 50 of the 64 channels have α ≈ 0.01 while 14 have α ≈ 0.4. What does this tell you about the data distribution those 14 channels are responding to?

  3. Leaky ReLU with α = 0.5 and standard ReLU both process the same 4-layer network with w = 0.5 at every connection. Compute the gradient at layer 1 for a sample where every neuron has z < 0. For ReLU it is 0; for Leaky ReLU at α = 0.5, what is it?

  4. A team trains a network with Leaky ReLU and observes that after 50 epochs, 80% of neurons in layer 3 have z < 0 for all training inputs. Is this a problem with Leaky ReLU, or is something else wrong? What would you investigate?

  5. Consider a Leaky ReLU with α → 1. The activation approaches linear: output = z for all z. A 3-hidden-layer network with all linear activations is equivalent to a single linear layer (composition of linear functions = linear). At what value of α does this become a practical concern, and how would you detect it in practice without inspecting α directly?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment