~/blog

GELU Activation Function

Jul 3, 2026•9 min read•By Mohammed Vasim

deep-learningneural-networksmachine-learningrepresentation-learning

ReLU kills negative inputs completely — a neuron with z=-0.1 gets exactly zero output and zero gradient, as if that signal never arrived. GELU replaces that hard gate with a soft one: instead of cutting the signal off at zero, it scales the input by the probability that the input is positive under a standard Gaussian. A small negative value like z=-0.1 passes through slightly damped rather than erased entirely. This is why GPT-2, BERT, and most modern Transformers use GELU as their default hidden activation.

Anchor: the same hidden layer from the churn network — six pre-activation values z ∈ {-3, -1, 0, 0.5, 1, 3}.

The Soft Gate Idea

ReLU is a hard gate: max(0, z). If z < 0, the gate is shut. If z > 0, the gate is open. No in-between.

GELU makes the gate probabilistic. The output is the input scaled by the probability that a standard Gaussian sample would be less than z:

GELU(z) = z · Φ(z)

where Φ(z) is the standard normal CDF — the probability that a random variable X ~ N(0,1) satisfies X ≤ z.

At z = 0: Φ(0) = 0.5, so GELU(0) = 0 · 0.5 = 0. The gate is half-open. At z = -1: Φ(-1) ≈ 0.158, so GELU(-1) = -1 · 0.158 = -0.158. Not zero — the negative signal passes through, but reduced. At z = 3: Φ(3) ≈ 0.999, so GELU(3) ≈ 3. The gate is nearly fully open.

Compare with ReLU at z = -1: output = 0. That neuron is dead for this input. GELU keeps it alive, just quieter.

Exact and Approximate Forms

Computing Φ(z) requires the error function, which is expensive. In practice, a fast approximation is used:

GELU(z) ≈ 0.5z · (1 + tanh(√(2/π) · (z + 0.044715z³)))

This is what PyTorch and TensorFlow actually compute. The exact form matters for research comparisons; the approximate form matters for production.

z	Φ(z)	GELU exact	GELU approx
−3	0.0013	−0.0040	−0.0040
−1	0.1587	−0.1587	−0.1588
0	0.5000	0.0000	0.0000
0.5	0.6915	0.3457	0.3457
1	0.8413	0.8413	0.8412
3	0.9987	2.9960	2.9960

The approximate form matches to four decimal places across the entire range — the error only becomes visible with extreme precision requirements.

Why the Gradient Matters

The derivative of GELU is:

GELU'(z) = Φ(z) + z · φ(z)

where φ(z) is the standard normal PDF. This is always positive for z > −0.75 or so, and non-zero for all z — there is no dead zone.

Compare at z = −0.5:

ReLU'(−0.5) = 0. No gradient. That weight never updates for this input.
GELU'(−0.5) = Φ(−0.5) + (−0.5)·φ(−0.5) = 0.3085 + (−0.5)(0.3521) ≈ 0.133. The gradient is small but non-zero.

z	GELU exact	GELU approx	GELU'(z)	ReLU	ReLU'
−3	−0.0040	−0.0040	0.0040	0	0
−1	−0.1587	−0.1588	0.0833	0	0
0	0.0000	0.0000	0.5000	0	0
0.5	0.3457	0.3457	0.8832	0.5	1
1	0.8413	0.8412	1.0833	1	1
3	2.9960	2.9960	1.0040	3	1

Transformer FFN Block

Every Transformer layer has a feed-forward sublayer with this structure:

FFN(x) = W₂ · GELU(W₁x + b₁) + b₂

Two linear projections with GELU between them. In BERT-base: d_model=768, d_ff=3072. Every token embedding passes through this block at each layer. The GELU sits between 768→3072 and 3072→768 projections — 24 times per forward pass in BERT-base.

The design choice of GELU over ReLU comes down to the dead-neuron problem at scale. With 3072 hidden units across 24 layers, ReLU can permanently kill a meaningful fraction of neurons during training. GELU avoids this by keeping gradients non-zero even for negative pre-activations.

Comparison: ReLU vs ELU vs GELU

z	ReLU	ELU (α=1)	GELU
−3	0	−0.9502	−0.0040
−1	0	−0.6321	−0.1587
0	0	0.0000	0.0000
0.5	0.5	0.5000	0.3457
1	1	1.0000	0.8413
3	3	3.0000	2.9960

ELU saturates negative inputs to −α as z → −∞. GELU drives negative inputs toward 0 as z → −∞. The choice between them depends on whether you want unbounded negative output (ELU, useful for centering) or near-zero negative output (GELU, matches Transformer design goals).

Code

python

import numpy as np
from scipy.stats import norm

def gelu_exact(z): return z * norm.cdf(z)
def gelu_approx(z):
    return 0.5 * z * (1 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3)))
def gelu_grad(z): return norm.cdf(z) + z * norm.pdf(z)

z_vals = np.array([-3., -1., 0., 0.5, 1., 3.])
print(f"{'z':>5} | {'GELU_exact':>10} | {'GELU_approx':>11} | {'GELU_grad':>9}")
for z in z_vals:
    print(f"{z:5.1f} | {gelu_exact(z):10.4f} | {gelu_approx(z):11.4f} | {gelu_grad(z):9.4f}")

text

z | GELU_exact | GELU_approx | GELU_grad
 -3.0 |    -0.0040 |     -0.0040 |    0.0040
 -1.0 |    -0.1587 |     -0.1588 |    0.0833
  0.0 |     0.0000 |      0.0000 |    0.5000
  0.5 |     0.3457 |      0.3457 |    0.8832
  1.0 |     0.8413 |      0.8412 |    1.0833
  3.0 |     2.9960 |      2.9960 |    1.0040

Notice the gradient column: at z=−3, GELU'=0.004 — tiny but non-zero. ReLU's gradient there is exactly 0. Across millions of parameters in a deep Transformer, that difference in gradient flow adds up.

GELU builds on ReLU (04-relu.md) by replacing the hard threshold with a probabilistic gate, and on tanh (03-tanh.md) which appears in the fast approximation. ELU (06-elu.md) solves the dead-neuron problem differently — it uses an exponential for negative inputs rather than a Gaussian gate. Understanding GELU unlocks SwiGLU (11-swiglu.md), which extends the gating idea by using a learned second projection to control how much of the Swish-gated signal passes through. SwiGLU is what LLaMA, Mistral, and Gemma use in place of the standard GELU FFN block.

Honest Limitations

GELU is more expensive than ReLU — computing Φ(z) requires the error function (or the tanh approximation), adding roughly 3–4× the FLOPs of a max(0,z) call. At inference on edge hardware this matters; use the approximate form or consider ReLU if latency is critical.

The approximate form introduces a small error near z=0 on the order of 1e-4. For most training this is irrelevant, but if you are comparing activation functions in a controlled experiment or need exact reproducibility, use the exact form and accept the compute cost.

Like ReLU, GELU outputs are skewed positive — outputs are never less than about −0.17 regardless of how negative the input is. This means GELU features are not zero-centered, which can cause the same upstream weight-update bias as ReLU. Batch normalization or layer normalization after the activation layer corrects for this in practice.

Test Your Understanding

GELU(0) = 0 even though Φ(0) = 0.5. Why? And what does GELU'(0) equal?
A neuron in BERT's FFN block receives pre-activation z = −2.5 at a particular position. Compute GELU(−2.5) using the exact formula. What gradient does it produce? Would ReLU produce a useful gradient here?
You replace GELU with ReLU in all 24 FFN blocks of BERT-base and fine-tune on a classification task. After 3 epochs, validation loss is higher than with GELU. What is the most likely cause, and how would you verify it?
The approximate GELU formula uses 0.044715z³. What happens to this term when z is small (say z=0.1)? What does the approximation reduce to in that limit, and which activation does it resemble?
A team argues that since GELU'(z) > 0 for all z (no dead neurons), training will always converge faster than with ReLU. Give a specific scenario where GELU could still slow training compared to ReLU despite having non-zero gradients.

GELU Activation Function

The Soft Gate Idea

Exact and Approximate Forms

Why the Gradient Matters

Transformer FFN Block

Comparison: ReLU vs ELU vs GELU

Code

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

GELU Activation Function

The Soft Gate Idea

Exact and Approximate Forms

Why the Gradient Matters

Transformer FFN Block

Comparison: ReLU vs ELU vs GELU

Code

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment