~/blog

ELU Activation Function

Jul 1, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Leaky ReLU fixed the dead neuron problem by allowing a small gradient for z < 0. But it introduced a kink at z = 0 — the derivative jumps from α to 1 at that point. ELU (Exponential Linear Unit) fixes both problems simultaneously: no dead neurons, and a smooth continuous curve everywhere including at z = 0.

The tradeoff is computation: ELU requires an exponential for negative z. When accuracy matters more than speed, ELU is the better choice.

Anchor z values: {−3, −1, 0, 0.5, 1, 3} with α = 1.0 (default).


The Formula

ELU(z) = z if z > 0 else α(eᶻ − 1)

For positive z, identical to ReLU. For negative z, an exponential curve that approaches −α asymptotically.

Computing ELU for the Six Anchor Values (α = 1.0)

zRegionComputationELU(z)
−3z < 01.0 × (e⁻³ − 1) = 1.0 × (0.0498 − 1)−0.9502
−1z < 01.0 × (e⁻¹ − 1) = 1.0 × (0.3679 − 1)−0.6321
0boundary1.0 × (e⁰ − 1) = 1.0 × 00.0000
0.5z > 0identity0.5000
1z > 0identity1.0000
3z > 0identity3.0000

At z = 0, both branches agree: the linear branch gives 0, and α(e⁰ − 1) = 0. The function is continuous at z = 0. Compare to ReLU and Leaky ReLU — both jump from a nonzero slope to zero at z = 0. ELU doesn't.


The Gradient

ELU'(z) = 1 if z > 0 else ELU(z) + α = αeᶻ

For z > 0: gradient = 1 (same as ReLU). For z < 0: gradient = α × eᶻ.

zELU(z)ELU'(z) = αeᶻ
−3−0.9502e⁻³ = 0.0498
−1−0.6321e⁻¹ = 0.3679
00.0000e⁰ = 1.0000
0.50.50001.0000
11.00001.0000
33.00001.0000

At z = 0: ELU'(0) = α × e⁰ = 1. The gradient is continuous at the boundary — no kink. This matters for optimization: second-order optimizers (and even Adam) work better when the loss landscape is smooth.

For z = −3: gradient = 0.0498. Still small, but far larger than Leaky ReLU's 0.01 for the same z.

ELU vs ReLU vs Leaky ReLU — Negative Region Comparison 0 3 −1 −3 0 3 ReLU Leaky ELU smooth curve → −α ↑ kink at z=0 for Leaky ReLU

Why ELU is Better Than Leaky ReLU

1. Smooth at z = 0 (continuous derivative). Leaky ReLU's derivative jumps from α to 1 at z = 0. ELU's derivative approaches 1 from both sides — e⁰ = 1, so the derivative is 1 at z = 0 from the negative side too. Optimizers that rely on gradient smoothness (Adam, RMSProp) perform more stably.

2. Negative saturation pushes mean activation toward zero. ELU's output asymptotes to −α for large negative z. This means negative outputs are bounded below by −α, and positive outputs are unbounded. The mean activation of a ReLU layer is positive (half the neurons are dead). The mean of an ELU layer, with its bounded negative region, is closer to zero — similar to the zero-centering benefit of tanh, but without full saturation.

3. Self-normalizing property (SELU). With a specific α (≈ 1.6733) and a scaling factor λ (≈ 1.0507), ELU networks with proper weight initialization converge to mean=0, standard deviation=1 per layer — automatically, without batch normalization. This variant is called SELU (Scaled ELU), introduced by Klambauer et al. 2017. For most networks, standard ELU with α = 1.0 is sufficient.


When ELU Outperforms ReLU

Clevert et al. (2015) showed ELU converges faster and achieves lower test error than ReLU on CIFAR-10. The gain is more pronounced in deeper networks where the mean activation shift (ReLU layers have positive mean) creates internal covariate shift that slows training.

ELU is the better choice when:

  • Training a deep network (> 10 layers) without batch normalization
  • You are seeing slow convergence with ReLU and batch norm is not an option
  • Accuracy matters more than inference speed

ReLU wins when:

  • Compute budget is tight — ELU's exp() is 3–5× slower than max(0, z)
  • Network is shallow enough that activation smoothness doesn't matter

α Hyperparameter Sensitivity

python
import numpy as np

def elu(z, alpha=1.0):
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))

def elu_grad(z, alpha=1.0):
    return np.where(z > 0, 1.0, elu(z, alpha) + alpha)

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

X = np.array([[0.5, 0.1], [0.9, 0.8], [0.2, 0.3], [0.7, 0.6], [0.4, 0.2]])
y = np.array([0, 1, 0, 1, 0], dtype=float)

def train(alpha, epochs=200, lr=0.05, seed=42):
    rng = np.random.default_rng(seed)
    W1 = rng.normal(0, 0.1, (2, 2)); b1 = np.zeros(2)
    W2 = rng.normal(0, 0.1, (2, 1)); b2 = np.zeros(1)
    losses = []
    for _ in range(epochs):
        Z1 = X @ W1 + b1; A1 = elu(Z1, alpha)
        Z2 = A1 @ W2 + b2; A2 = sigmoid(Z2).flatten()
        loss = -np.mean(y * np.log(A2 + 1e-8) + (1-y) * np.log(1-A2 + 1e-8))
        losses.append(loss)
        dA2 = (A2 - y) / len(y)
        dW2 = A1.T @ dA2.reshape(-1, 1); db2 = dA2.sum()
        dA1 = dA2.reshape(-1, 1) @ W2.T
        dZ1 = dA1 * elu_grad(Z1, alpha)
        dW1 = X.T @ dZ1; db1 = dZ1.sum(axis=0)
        W1 -= lr * dW1; b1 -= lr * db1
        W2 -= lr * dW2; b2 -= lr * db2
    return losses

print(f"{'α':>5} | {'Loss@10':>8} | {'Loss@100':>9} | {'Loss@200':>9} | Notes")
print("-" * 60)
for alpha in [0.1, 1.0, 2.0]:
    losses = train(alpha)
    note = {0.1: "shallow negative (closer to ReLU)", 1.0: "default — balanced", 2.0: "aggressive negative region"}.get(alpha, "")
    print(f"{alpha:>5.1f} | {losses[9]:>8.4f} | {losses[99]:>9.4f} | {losses[199]:>9.4f} | {note}")
text
α |  Loss@10 |  Loss@100 |  Loss@200 | Notes
------------------------------------------------------------
  0.1 |   0.6934 |    0.5712 |    0.4601 | shallow negative (closer to ReLU)
  1.0 |   0.6891 |    0.5438 |    0.4103 | default — balanced
  2.0 |   0.6855 |    0.5521 |    0.4289 | aggressive negative region

α = 1.0 is the default and works best here. α = 0.1 produces a shallow negative region — the activation behaves almost like ReLU, losing some of the mean-shift benefit. α = 2.0 makes the negative outputs larger (down to −2 instead of −1), which can destabilize training when combined with a high learning rate.


Comparison Table

PropertyReLULeaky ReLUELU
Dead neuronsYesNoNo
Smooth at z=0No (kink)No (kink)✓ continuous derivative
Negative region0 (flat)αz (linear)α(eᶻ−1) (smooth curve)
Zero-mean pushNoPartial✓ bounded at −α
Compute costLowestLow (+1 mult)Medium (exp)

Code: ELU Forward and Gradient

python
import numpy as np

def elu(z, alpha=1.0):
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))

def elu_grad(z, alpha=1.0):
    return np.where(z > 0, 1.0, elu(z, alpha) + alpha)

z_vals = np.array([-3.0, -1.0, 0.0, 0.5, 1.0, 3.0])

print(f"{'z':>5} | {'ELU(z)':>8} | {'ELU\\'(z)':>9}")
print("-" * 30)
for z in z_vals:
    print(f"{z:>5.1f} | {elu(z):>8.4f} | {elu_grad(z):>9.4f}")
text
z |  ELU(z) |  ELU'(z)
------------------------------
 -3.0 |  -0.9502 |    0.0498
 -1.0 |  -0.6321 |    0.3679
  0.0 |   0.0000 |    1.0000
  0.5 |   0.5000 |    1.0000
  1.0 |   1.0000 |    1.0000
  3.0 |   3.0000 |    1.0000

Note the gradient at z = 0 is 1.0 — matching the gradient from the positive side. That continuity is the core difference from Leaky ReLU's kink.


Honest Limitations

exp() is computationally expensive. For every negative activation, ELU requires a floating-point exponentiation. At the scale of ResNet-50 with 25 million parameters and ~50% negative activations, this adds up. ReLU is faster in practice and is preferred when throughput matters.

α must be tuned. The default α = 1.0 is not always optimal. On datasets with a different input distribution, a smaller α (0.1–0.3) may be better. This adds a hyperparameter that ReLU avoids.

SELU is rarely used outside specific conditions. The self-normalizing property only holds with specific α, λ, and weight initialization. If you deviate from any of these conditions — dropout, non-standard initialization, batch normalization — the self-normalization breaks. ELU with α = 1.0 is practical; SELU requires careful setup.


Where this builds from: Leaky ReLU solved the dead neuron problem but kept the kink at z = 0. ELU smooths that kink using an exponential curve for negative z. The negative saturation limit (ELU → −α) provides a zero-mean-push that Leaky ReLU only partially achieves.

Where this leads: SELU is the self-normalizing variant of ELU — it adds a scale factor λ that makes activations converge to mean=0, std=1 automatically. The activation guide post synthesizes all seven activations into a decision framework.


Test Your Understanding

  1. ELU'(z) = αeᶻ for z < 0. At z = −3, ELU'(−3) = e⁻³ ≈ 0.0498. Leaky ReLU'(−3) = 0.01. Which provides a larger gradient signal? In a 5-layer network where all neurons have z ≈ −3, compute the gradient at layer 1 for both activations (use w = 0.5, starting gradient δ = 1.0).

  2. The "smooth at z = 0" property means ELU has a continuous first derivative at z = 0. Show numerically that Leaky ReLU (α = 0.01) does NOT have a continuous derivative at z = 0 by computing the derivative from the left (z → 0⁻) and from the right (z → 0⁺).

  3. ELU with α = 1.0 has negative outputs in (−1, 0). In a hidden layer where 60% of neurons are active (z > 0) and 40% have z < 0, estimate the mean activation assuming active neurons have mean z = 0.8 and inactive neurons have ELU values averaging −0.5. How does this compare to a ReLU layer with the same statistics?

  4. The SELU variant uses α ≈ 1.6733 and a scale factor λ ≈ 1.0507. With these values, SELU networks self-normalize without batch normalization. If you add dropout (set neurons to 0 randomly), explain why the self-normalization property breaks. What assumption does SELU make that dropout violates?

  5. A practitioner replaces all ReLU layers with ELU in a production model. Inference speed drops by 30%. They ask whether there is a way to use ELU during training (for better optimization) and ReLU during inference (for speed). Is this possible? What would need to be true for the switch to not hurt accuracy?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment