~/blog

GELU Activation Function

Jul 3, 20269 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

ReLU kills negative inputs completely — a neuron with z=-0.1 gets exactly zero output and zero gradient, as if that signal never arrived. GELU replaces that hard gate with a soft one: instead of cutting the signal off at zero, it scales the input by the probability that the input is positive under a standard Gaussian. A small negative value like z=-0.1 passes through slightly damped rather than erased entirely. This is why GPT-2, BERT, and most modern Transformers use GELU as their default hidden activation.

Anchor: the same hidden layer from the churn network — six pre-activation values z ∈ {-3, -1, 0, 0.5, 1, 3}.


The Soft Gate Idea

ReLU is a hard gate: max(0, z). If z < 0, the gate is shut. If z > 0, the gate is open. No in-between.

GELU makes the gate probabilistic. The output is the input scaled by the probability that a standard Gaussian sample would be less than z:

GELU(z) = z · Φ(z)

where Φ(z) is the standard normal CDF — the probability that a random variable X ~ N(0,1) satisfies X ≤ z.

At z = 0: Φ(0) = 0.5, so GELU(0) = 0 · 0.5 = 0. The gate is half-open. At z = -1: Φ(-1) ≈ 0.158, so GELU(-1) = -1 · 0.158 = -0.158. Not zero — the negative signal passes through, but reduced. At z = 3: Φ(3) ≈ 0.999, so GELU(3) ≈ 3. The gate is nearly fully open.

Compare with ReLU at z = -1: output = 0. That neuron is dead for this input. GELU keeps it alive, just quieter.

Hard Gate (ReLU) vs Soft Gate (GELU) at z = −1 ReLU Input z = −1 Gate: max(0, z) 0 Gate fully closed Gradient = 0 No learning signal GELU Input z = −1 Gate: z · Φ(z) −0.158 Gate 15.8% open Gradient ≈ 0.13 Signal passes through

Exact and Approximate Forms

Computing Φ(z) requires the error function, which is expensive. In practice, a fast approximation is used:

GELU(z) ≈ 0.5z · (1 + tanh(√(2/π) · (z + 0.044715z³)))

This is what PyTorch and TensorFlow actually compute. The exact form matters for research comparisons; the approximate form matters for production.

zΦ(z)GELU exactGELU approx
−30.0013−0.0040−0.0040
−10.1587−0.1587−0.1588
00.50000.00000.0000
0.50.69150.34570.3457
10.84130.84130.8412
30.99872.99602.9960

The approximate form matches to four decimal places across the entire range — the error only becomes visible with extreme precision requirements.

GELU vs ReLU — Anchor Points Labeled 0 3 −3 −3 −1 0 0.5 1 3 −0.004 −0.159 0.000 0.346 0.841 2.996 GELU ReLU negative z: GELU ≠ 0, ReLU = 0

Why the Gradient Matters

The derivative of GELU is:

GELU'(z) = Φ(z) + z · φ(z)

where φ(z) is the standard normal PDF. This is always positive for z > −0.75 or so, and non-zero for all z — there is no dead zone.

Compare at z = −0.5:

  • ReLU'(−0.5) = 0. No gradient. That weight never updates for this input.
  • GELU'(−0.5) = Φ(−0.5) + (−0.5)·φ(−0.5) = 0.3085 + (−0.5)(0.3521) ≈ 0.133. The gradient is small but non-zero.
zGELU exactGELU approxGELU'(z)ReLUReLU'
−3−0.0040−0.00400.004000
−1−0.1587−0.15880.083300
00.00000.00000.500000
0.50.34570.34570.88320.51
10.84130.84121.083311
32.99602.99601.004031
Gradient: GELU'(z) vs ReLU'(z) 0 1.1 −3 0 1 3 ReLU dead zone (gradient=0) 0.083 0.500 GELU' ReLU'

Transformer FFN Block

Every Transformer layer has a feed-forward sublayer with this structure:

FFN(x) = W₂ · GELU(W₁x + b₁) + b₂

Two linear projections with GELU between them. In BERT-base: d_model=768, d_ff=3072. Every token embedding passes through this block at each layer. The GELU sits between 768→3072 and 3072→768 projections — 24 times per forward pass in BERT-base.

The design choice of GELU over ReLU comes down to the dead-neuron problem at scale. With 3072 hidden units across 24 layers, ReLU can permanently kill a meaningful fraction of neurons during training. GELU avoids this by keeping gradients non-zero even for negative pre-activations.


Comparison: ReLU vs ELU vs GELU

zReLUELU (α=1)GELU
−30−0.9502−0.0040
−10−0.6321−0.1587
000.00000.0000
0.50.50.50000.3457
111.00000.8413
333.00002.9960

ELU saturates negative inputs to −α as z → −∞. GELU drives negative inputs toward 0 as z → −∞. The choice between them depends on whether you want unbounded negative output (ELU, useful for centering) or near-zero negative output (GELU, matches Transformer design goals).

ReLU vs ELU vs GELU — z < 0 Region 0 3 −1 −3 0 1 3 highlighted: z < 0 ReLU ELU GELU ReLU: flat at 0 ELU: saturates →−1 GELU: near 0, not flat

Code

python
import numpy as np
from scipy.stats import norm

def gelu_exact(z): return z * norm.cdf(z)
def gelu_approx(z):
    return 0.5 * z * (1 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3)))
def gelu_grad(z): return norm.cdf(z) + z * norm.pdf(z)

z_vals = np.array([-3., -1., 0., 0.5, 1., 3.])
print(f"{'z':>5} | {'GELU_exact':>10} | {'GELU_approx':>11} | {'GELU_grad':>9}")
for z in z_vals:
    print(f"{z:5.1f} | {gelu_exact(z):10.4f} | {gelu_approx(z):11.4f} | {gelu_grad(z):9.4f}")
text
z | GELU_exact | GELU_approx | GELU_grad
 -3.0 |    -0.0040 |     -0.0040 |    0.0040
 -1.0 |    -0.1587 |     -0.1588 |    0.0833
  0.0 |     0.0000 |      0.0000 |    0.5000
  0.5 |     0.3457 |      0.3457 |    0.8832
  1.0 |     0.8413 |      0.8412 |    1.0833
  3.0 |     2.9960 |      2.9960 |    1.0040

Notice the gradient column: at z=−3, GELU'=0.004 — tiny but non-zero. ReLU's gradient there is exactly 0. Across millions of parameters in a deep Transformer, that difference in gradient flow adds up.


GELU builds on ReLU (04-relu.md) by replacing the hard threshold with a probabilistic gate, and on tanh (03-tanh.md) which appears in the fast approximation. ELU (06-elu.md) solves the dead-neuron problem differently — it uses an exponential for negative inputs rather than a Gaussian gate. Understanding GELU unlocks SwiGLU (11-swiglu.md), which extends the gating idea by using a learned second projection to control how much of the Swish-gated signal passes through. SwiGLU is what LLaMA, Mistral, and Gemma use in place of the standard GELU FFN block.

Honest Limitations

GELU is more expensive than ReLU — computing Φ(z) requires the error function (or the tanh approximation), adding roughly 3–4× the FLOPs of a max(0,z) call. At inference on edge hardware this matters; use the approximate form or consider ReLU if latency is critical.

The approximate form introduces a small error near z=0 on the order of 1e-4. For most training this is irrelevant, but if you are comparing activation functions in a controlled experiment or need exact reproducibility, use the exact form and accept the compute cost.

Like ReLU, GELU outputs are skewed positive — outputs are never less than about −0.17 regardless of how negative the input is. This means GELU features are not zero-centered, which can cause the same upstream weight-update bias as ReLU. Batch normalization or layer normalization after the activation layer corrects for this in practice.


Test Your Understanding

  1. GELU(0) = 0 even though Φ(0) = 0.5. Why? And what does GELU'(0) equal?

  2. A neuron in BERT's FFN block receives pre-activation z = −2.5 at a particular position. Compute GELU(−2.5) using the exact formula. What gradient does it produce? Would ReLU produce a useful gradient here?

  3. You replace GELU with ReLU in all 24 FFN blocks of BERT-base and fine-tune on a classification task. After 3 epochs, validation loss is higher than with GELU. What is the most likely cause, and how would you verify it?

  4. The approximate GELU formula uses 0.044715z³. What happens to this term when z is small (say z=0.1)? What does the approximation reduce to in that limit, and which activation does it resemble?

  5. A team argues that since GELU'(z) > 0 for all z (no dead neurons), training will always converge faster than with ReLU. Give a specific scenario where GELU could still slow training compared to ReLU despite having non-zero gradients.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment