~/blog
GELU Activation Function
ReLU kills negative inputs completely — a neuron with z=-0.1 gets exactly zero output and zero gradient, as if that signal never arrived. GELU replaces that hard gate with a soft one: instead of cutting the signal off at zero, it scales the input by the probability that the input is positive under a standard Gaussian. A small negative value like z=-0.1 passes through slightly damped rather than erased entirely. This is why GPT-2, BERT, and most modern Transformers use GELU as their default hidden activation.
Anchor: the same hidden layer from the churn network — six pre-activation values z ∈ {-3, -1, 0, 0.5, 1, 3}.
The Soft Gate Idea
ReLU is a hard gate: max(0, z). If z < 0, the gate is shut. If z > 0, the gate is open. No in-between.
GELU makes the gate probabilistic. The output is the input scaled by the probability that a standard Gaussian sample would be less than z:
GELU(z) = z · Φ(z)
where Φ(z) is the standard normal CDF — the probability that a random variable X ~ N(0,1) satisfies X ≤ z.
At z = 0: Φ(0) = 0.5, so GELU(0) = 0 · 0.5 = 0. The gate is half-open. At z = -1: Φ(-1) ≈ 0.158, so GELU(-1) = -1 · 0.158 = -0.158. Not zero — the negative signal passes through, but reduced. At z = 3: Φ(3) ≈ 0.999, so GELU(3) ≈ 3. The gate is nearly fully open.
Compare with ReLU at z = -1: output = 0. That neuron is dead for this input. GELU keeps it alive, just quieter.
Exact and Approximate Forms
Computing Φ(z) requires the error function, which is expensive. In practice, a fast approximation is used:
GELU(z) ≈ 0.5z · (1 + tanh(√(2/π) · (z + 0.044715z³)))
This is what PyTorch and TensorFlow actually compute. The exact form matters for research comparisons; the approximate form matters for production.
| z | Φ(z) | GELU exact | GELU approx |
|---|---|---|---|
| −3 | 0.0013 | −0.0040 | −0.0040 |
| −1 | 0.1587 | −0.1587 | −0.1588 |
| 0 | 0.5000 | 0.0000 | 0.0000 |
| 0.5 | 0.6915 | 0.3457 | 0.3457 |
| 1 | 0.8413 | 0.8413 | 0.8412 |
| 3 | 0.9987 | 2.9960 | 2.9960 |
The approximate form matches to four decimal places across the entire range — the error only becomes visible with extreme precision requirements.
Why the Gradient Matters
The derivative of GELU is:
GELU'(z) = Φ(z) + z · φ(z)
where φ(z) is the standard normal PDF. This is always positive for z > −0.75 or so, and non-zero for all z — there is no dead zone.
Compare at z = −0.5:
- ReLU'(−0.5) = 0. No gradient. That weight never updates for this input.
- GELU'(−0.5) = Φ(−0.5) + (−0.5)·φ(−0.5) = 0.3085 + (−0.5)(0.3521) ≈ 0.133. The gradient is small but non-zero.
| z | GELU exact | GELU approx | GELU'(z) | ReLU | ReLU' |
|---|---|---|---|---|---|
| −3 | −0.0040 | −0.0040 | 0.0040 | 0 | 0 |
| −1 | −0.1587 | −0.1588 | 0.0833 | 0 | 0 |
| 0 | 0.0000 | 0.0000 | 0.5000 | 0 | 0 |
| 0.5 | 0.3457 | 0.3457 | 0.8832 | 0.5 | 1 |
| 1 | 0.8413 | 0.8412 | 1.0833 | 1 | 1 |
| 3 | 2.9960 | 2.9960 | 1.0040 | 3 | 1 |
Transformer FFN Block
Every Transformer layer has a feed-forward sublayer with this structure:
FFN(x) = W₂ · GELU(W₁x + b₁) + b₂
Two linear projections with GELU between them. In BERT-base: d_model=768, d_ff=3072. Every token embedding passes through this block at each layer. The GELU sits between 768→3072 and 3072→768 projections — 24 times per forward pass in BERT-base.
The design choice of GELU over ReLU comes down to the dead-neuron problem at scale. With 3072 hidden units across 24 layers, ReLU can permanently kill a meaningful fraction of neurons during training. GELU avoids this by keeping gradients non-zero even for negative pre-activations.
Comparison: ReLU vs ELU vs GELU
| z | ReLU | ELU (α=1) | GELU |
|---|---|---|---|
| −3 | 0 | −0.9502 | −0.0040 |
| −1 | 0 | −0.6321 | −0.1587 |
| 0 | 0 | 0.0000 | 0.0000 |
| 0.5 | 0.5 | 0.5000 | 0.3457 |
| 1 | 1 | 1.0000 | 0.8413 |
| 3 | 3 | 3.0000 | 2.9960 |
ELU saturates negative inputs to −α as z → −∞. GELU drives negative inputs toward 0 as z → −∞. The choice between them depends on whether you want unbounded negative output (ELU, useful for centering) or near-zero negative output (GELU, matches Transformer design goals).
Code
import numpy as np
from scipy.stats import norm
def gelu_exact(z): return z * norm.cdf(z)
def gelu_approx(z):
return 0.5 * z * (1 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3)))
def gelu_grad(z): return norm.cdf(z) + z * norm.pdf(z)
z_vals = np.array([-3., -1., 0., 0.5, 1., 3.])
print(f"{'z':>5} | {'GELU_exact':>10} | {'GELU_approx':>11} | {'GELU_grad':>9}")
for z in z_vals:
print(f"{z:5.1f} | {gelu_exact(z):10.4f} | {gelu_approx(z):11.4f} | {gelu_grad(z):9.4f}")z | GELU_exact | GELU_approx | GELU_grad
-3.0 | -0.0040 | -0.0040 | 0.0040
-1.0 | -0.1587 | -0.1588 | 0.0833
0.0 | 0.0000 | 0.0000 | 0.5000
0.5 | 0.3457 | 0.3457 | 0.8832
1.0 | 0.8413 | 0.8412 | 1.0833
3.0 | 2.9960 | 2.9960 | 1.0040Notice the gradient column: at z=−3, GELU'=0.004 — tiny but non-zero. ReLU's gradient there is exactly 0. Across millions of parameters in a deep Transformer, that difference in gradient flow adds up.
Related Concepts
GELU builds on ReLU (04-relu.md) by replacing the hard threshold with a probabilistic gate, and on tanh (03-tanh.md) which appears in the fast approximation. ELU (06-elu.md) solves the dead-neuron problem differently — it uses an exponential for negative inputs rather than a Gaussian gate. Understanding GELU unlocks SwiGLU (11-swiglu.md), which extends the gating idea by using a learned second projection to control how much of the Swish-gated signal passes through. SwiGLU is what LLaMA, Mistral, and Gemma use in place of the standard GELU FFN block.
Honest Limitations
GELU is more expensive than ReLU — computing Φ(z) requires the error function (or the tanh approximation), adding roughly 3–4× the FLOPs of a max(0,z) call. At inference on edge hardware this matters; use the approximate form or consider ReLU if latency is critical.
The approximate form introduces a small error near z=0 on the order of 1e-4. For most training this is irrelevant, but if you are comparing activation functions in a controlled experiment or need exact reproducibility, use the exact form and accept the compute cost.
Like ReLU, GELU outputs are skewed positive — outputs are never less than about −0.17 regardless of how negative the input is. This means GELU features are not zero-centered, which can cause the same upstream weight-update bias as ReLU. Batch normalization or layer normalization after the activation layer corrects for this in practice.
Test Your Understanding
-
GELU(0) = 0 even though Φ(0) = 0.5. Why? And what does GELU'(0) equal?
-
A neuron in BERT's FFN block receives pre-activation z = −2.5 at a particular position. Compute GELU(−2.5) using the exact formula. What gradient does it produce? Would ReLU produce a useful gradient here?
-
You replace GELU with ReLU in all 24 FFN blocks of BERT-base and fine-tune on a classification task. After 3 epochs, validation loss is higher than with GELU. What is the most likely cause, and how would you verify it?
-
The approximate GELU formula uses 0.044715z³. What happens to this term when z is small (say z=0.1)? What does the approximation reduce to in that limit, and which activation does it resemble?
-
A team argues that since GELU'(z) > 0 for all z (no dead neurons), training will always converge faster than with ReLU. Give a specific scenario where GELU could still slow training compared to ReLU despite having non-zero gradients.