~/blog
GRU — In-depth Intuition
The previous post previewed GRU as one of several LSTM variants. It deserves its own deep dive because it isn't a minor tweak — it's a genuine simplification of the whole cell that Cho et al. (2014) showed matches LSTM's performance on many tasks with noticeably fewer parameters. The core move: stop tracking cell state and hidden state as two separate vectors, and fold their jobs into one.
Anchor: continuing the stock price sequence, h_dim=4, x_dim=1. h₀=[0,0,0,0], x₁=0.5 (a normalized price).
GRU vs LSTM: The Core Simplification
LSTM tracks 2 states (hₜ, Cₜ) and computes 4 gates (fₜ, iₜ, C̃ₜ, oₜ) to manage the relationship between them. GRU merges Cₜ and hₜ into a single state hₜ, and reduces the gate count to 2: a reset gate rₜ and an update gate zₜ. Fewer moving parts, fewer weight matrices, and — per Cho et al.'s original results — often comparable accuracy.
GRU Equations
rₜ = σ(Wr·[hₜ₋₁, xₜ] + br) # reset gate
zₜ = σ(Wz·[hₜ₋₁, xₜ] + bz) # update gate
ñₜ = tanh(Wn·[rₜ⊙hₜ₋₁, xₜ] + bn) # candidate hidden state
hₜ = (1−zₜ)⊙hₜ₋₁ + zₜ⊙ñₜ # hidden state updateNote where rₜ appears: it scales hₜ₋₁ before the candidate hidden state is computed, not after. This is different from every LSTM gate, which all scale things after their respective transformations.
Gate Intuitions
Reset gate rₜ controls how much of the previous hidden state gets used when forming the new candidate:
- rₜ ≈ 0: the candidate ñₜ is computed almost entirely from xₜ alone — effectively "start fresh," ignoring history for this step's proposal.
- rₜ ≈ 1: the full hₜ₋₁ is used — behaves like a standard RNN's candidate computation.
Update gate zₜ controls how much the hidden state actually changes — it plays the combined role of LSTM's forget and input gates:
- zₜ ≈ 0: hₜ stays essentially equal to hₜ₋₁ — long-term memory preserved, equivalent to LSTM's forget=1, input=0.
- zₜ ≈ 1: hₜ is almost fully replaced by ñₜ — full attention to the current input, equivalent to LSTM's forget=0, input=1.
Where LSTM lets forget and input vary independently (and post 07 noted this can let a model both "keep everything and add everything"), GRU's single zₜ ties them together by construction: whatever fraction isn't kept is exactly the fraction that gets replaced.
Numerical Walkthrough
With h₀ = [0,0,0,0] and x₁ = 0.5:
rₜ = σ(Wr·[h₀,x₁] + br) = [0.4971, 0.5068, 0.4785, 0.4824]
zₜ = σ(Wz·[h₀,x₁] + bz) = [0.4932, 0.4964, 0.5103, 0.5025]
ñₜ = tanh(Wn·[rₜ⊙h₀, x₁] + bn) = [-0.0738, -0.0879, 0.0515, 0.0487]
(Since h₀ = 0, rₜ⊙h₀ = 0 regardless of rₜ's value — at the very first timestep the reset gate has nothing to act on yet, same as the forget gate in the LSTM's first-step example from post 02.)
hₜ = (1−zₜ)⊙h₀ + zₜ⊙ñₜ = 0 + zₜ⊙ñₜ = [-0.0364, -0.0436, 0.0263, 0.0245]
| Phase | Formula | Values substituted (dim 0) | Result (all 4 dims) |
|---|---|---|---|
| Reset gate | rₜ = σ(Wr·[h₀,x₁]+br) | σ(Wr[0]·[0,0,0,0,0.5]+0) | [0.4971, 0.5068, 0.4785, 0.4824] |
| Update gate | zₜ = σ(Wz·[h₀,x₁]+bz) | σ(Wz[0]·[0,0,0,0,0.5]+0) | [0.4932, 0.4964, 0.5103, 0.5025] |
| Candidate state | ñₜ = tanh(Wn·[rₜ⊙h₀,x₁]+bn) | tanh(Wn[0]·[0,0,0,0,0.5]+0) since rₜ⊙h₀=0 | [-0.0738, -0.0879, 0.0515, 0.0487] |
| Hidden state update | hₜ = (1−zₜ)⊙h₀ + zₜ⊙ñₜ | (1−0.4932)·0 + 0.4932·(-0.0738) | [-0.0364, -0.0436, 0.0263, 0.0245] |
GRU vs LSTM: When to Use Which
| LSTM | GRU | |
|---|---|---|
| Parameters | More | Fewer (~75%) |
| Training speed | Slower | Faster |
| Long sequences (T>500) | Slightly better | Slightly worse |
| Short-medium sequences | Similar | Similar |
| Recommendation | When accuracy is critical | When speed/simplicity matters |
General rule of thumb: try GRU first — it's cheaper to train and often matches LSTM. Switch to LSTM only if GRU's accuracy turns out insufficient for the task.
Parameter Count Comparison
LSTM (h=4, x=1): 4 gates × (h+x)×h + h = 4 × 5×4 + 4 = 84
GRU (h=4, x=1): 3 components (r, z, ñ) × (h+x)×h + h = 3 × 5×4 + 4 = 64
GRU has (84−64)/84 = 23.8% fewer parameters at these dimensions — the gap widens further as h_dim grows, since each removed gate saves a full (h+x)×h weight matrix.
Hyperparameter Sensitivity: Hidden Dimension Size
h_dim controls how much state the GRU can carry between timesteps, and it's the first knob to tune when a GRU under- or over-fits the anchor's stock sequence.
for h_dim in [2, 4, 8, 16, 32]:
params = 3 * (h_dim + x_dim) * h_dim + h_dim
Wr = np.random.randn(h_dim, h_dim + x_dim) * 0.1
br = np.zeros(h_dim)
inp = np.concatenate([np.zeros(h_dim), [0.5]])
r = sigmoid(Wr @ inp + br)
print(f"h_dim={h_dim:3d} params={params:5d} r_range=[{r.min():.4f}, {r.max():.4f}]")h_dim= 2 params= 14 r_range=[0.4859, 0.5081]
h_dim= 4 params= 64 r_range=[0.4785, 0.5068]
h_dim= 8 params= 272 r_range=[0.4416, 0.5433]
h_dim= 16 params= 1120 r_range=[0.3789, 0.6015]
h_dim= 32 params= 4544 r_range=[0.3103, 0.6721]Parameter count grows quadratically with h_dim (the 3×(h+x)×h term), so doubling hidden size roughly quadruples the weight count once x_dim is small relative to h_dim. At h_dim=2, the state is too narrow to encode much beyond a single trend direction — gate values barely move from 0.5, meaning the reset/update gates carry almost no information. At h_dim=32, gate outputs spread much further from 0.5, giving the network more room to specialize per-dimension behavior, but at 4544 parameters for this toy setup the model is now overkill for a single scalar input and risks overfitting a short stock sequence. The anchor's h_dim=4 sits in the useful middle: enough capacity to show distinct reset/update behavior per dimension without the parameter count dominating a small dataset.
Code
import numpy as np
def sigmoid(z): return 1/(1+np.exp(-z))
h_dim, x_dim = 4, 1
np.random.seed(42)
Wr = np.random.randn(h_dim, h_dim+x_dim) * 0.1; br = np.zeros(h_dim)
Wz = np.random.randn(h_dim, h_dim+x_dim) * 0.1; bz = np.zeros(h_dim)
Wn = np.random.randn(h_dim, h_dim+x_dim) * 0.1; bn = np.zeros(h_dim)
def gru_step(x, h_prev):
inp = np.concatenate([h_prev, [x]])
r = sigmoid(Wr @ inp + br)
z = sigmoid(Wz @ inp + bz)
inp_reset = np.concatenate([r * h_prev, [x]])
n_tilde = np.tanh(Wn @ inp_reset + bn)
h = (1 - z) * h_prev + z * n_tilde
return h, {'r': r, 'z': z, 'n_tilde': n_tilde}
h = np.zeros(h_dim)
h, gates = gru_step(0.5, h)
print("Reset gate r: ", np.round(gates['r'], 4))
print("Update gate z: ", np.round(gates['z'], 4))
print("Candidate ñ: ", np.round(gates['n_tilde'], 4))
print("New hidden state:", np.round(h, 4))Reset gate r: [0.4971 0.5068 0.4785 0.4824]
Update gate z: [0.4932 0.4964 0.5103 0.5025]
Candidate ñ: [-0.0738 -0.0879 0.0515 0.0487]
New hidden state: [-0.0364 -0.0436 0.0263 0.0245]Related Concepts
Where this builds from: LSTM's full architecture (post 02) and the variants overview (post 07), which previewed GRU as one of several structural modifications to the standard cell.
Where this leads: Both bidirectional RNNs and encoder-decoder architectures — covered in the advanced architectures section of this series — commonly use GRU cells as their default building block, precisely because of the parameter and speed advantages shown here.
Honest Limitations
GRU's single hidden state has less representational capacity than LSTM's separate cell state plus hidden state, and this gap tends to show up on very long sequences (T>500) — with no dedicated "long-term-only" channel analogous to Cₜ, GRU can underperform LSTM specifically on tasks with long-range dependencies, even though it matches or beats LSTM on short-to-medium sequences.
The update gate's coupling (zₜ for replace, 1−zₜ for keep) means that if the candidate ñₜ is noisy at some step where zₜ happens to be large, that noise directly overwrites a correspondingly large fraction of the previous hidden state — there's no independent "keep old, add a little new" option the way LSTM's separate forget and input gates allow; a single bad candidate can do more damage per step in GRU than in LSTM.
Test Your Understanding
-
Why does the reset gate rₜ scale hₜ₋₁ before it enters the candidate computation, rather than scaling the candidate ñₜ afterward the way LSTM's gates operate on already-computed quantities?
-
At the anchor's first timestep, rₜ⊙h₀ = 0 regardless of rₜ's actual values, because h₀=0. Recompute what the reset gate's effect would be at a hypothetical second timestep where h₁ = [-0.0364, -0.0436, 0.0263, 0.0245] (the result from this post) and rₜ = [0.1, 0.9, 0.5, 0.5].
-
Using the parameter formulas given (LSTM: 4×(h+x)×h+h, GRU: 3×(h+x)×h+h), compute both parameter counts at h_dim=128, x_dim=16, and confirm whether the percentage reduction stays close to 24% or changes at larger scale.
-
A team is choosing between LSTM and GRU for a task with sequences of length ~800 (sensor readings over multiple days) where long-range dependencies are known to matter. Based on the comparison table and limitations in this post, what would you recommend, and what would make you reconsider?
-
Given that GRU's update gate zₜ plays the combined role of LSTM's forget and input gates, construct a specific hypothetical hidden-state scenario where LSTM's independent forget/input gating would produce a clearly different (and better) result than GRU's coupled zₜ/(1−zₜ) — building on the "keep a lot and add a little" case that coupled gates (post 07) also couldn't represent.