~/blog

GRU — In-depth Intuition

Jul 3, 20267 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

The previous post previewed GRU as one of several LSTM variants. It deserves its own deep dive because it isn't a minor tweak — it's a genuine simplification of the whole cell that Cho et al. (2014) showed matches LSTM's performance on many tasks with noticeably fewer parameters. The core move: stop tracking cell state and hidden state as two separate vectors, and fold their jobs into one.

Anchor: continuing the stock price sequence, h_dim=4, x_dim=1. h₀=[0,0,0,0], x₁=0.5 (a normalized price).


GRU vs LSTM: The Core Simplification

LSTM tracks 2 states (hₜ, Cₜ) and computes 4 gates (fₜ, iₜ, C̃ₜ, oₜ) to manage the relationship between them. GRU merges Cₜ and hₜ into a single state hₜ, and reduces the gate count to 2: a reset gate rₜ and an update gate zₜ. Fewer moving parts, fewer weight matrices, and — per Cho et al.'s original results — often comparable accuracy.

GRU Cell — Reset Gate, Update Gate, Candidate, Blend [hₜ₋₁, xₜ] rₜ=σ zₜ=σ × rₜ⊙hₜ₋₁ ñₜ=tanh + (1-zₜ)⊙hₜ₋₁ zₜ⊙ñₜ hₜ Single state hₜ carries forward — no separate cell state

GRU Equations

text
rₜ = σ(Wr·[hₜ₋₁, xₜ] + br)         # reset gate
zₜ = σ(Wz·[hₜ₋₁, xₜ] + bz)         # update gate
ñₜ = tanh(Wn·[rₜ⊙hₜ₋₁, xₜ] + bn)   # candidate hidden state
hₜ = (1−zₜ)⊙hₜ₋₁ + zₜ⊙ñₜ           # hidden state update

Note where rₜ appears: it scales hₜ₋₁ before the candidate hidden state is computed, not after. This is different from every LSTM gate, which all scale things after their respective transformations.


Gate Intuitions

Reset gate rₜ controls how much of the previous hidden state gets used when forming the new candidate:

  • rₜ ≈ 0: the candidate ñₜ is computed almost entirely from xₜ alone — effectively "start fresh," ignoring history for this step's proposal.
  • rₜ ≈ 1: the full hₜ₋₁ is used — behaves like a standard RNN's candidate computation.

Update gate zₜ controls how much the hidden state actually changes — it plays the combined role of LSTM's forget and input gates:

  • zₜ ≈ 0: hₜ stays essentially equal to hₜ₋₁ — long-term memory preserved, equivalent to LSTM's forget=1, input=0.
  • zₜ ≈ 1: hₜ is almost fully replaced by ñₜ — full attention to the current input, equivalent to LSTM's forget=0, input=1.

Where LSTM lets forget and input vary independently (and post 07 noted this can let a model both "keep everything and add everything"), GRU's single zₜ ties them together by construction: whatever fraction isn't kept is exactly the fraction that gets replaced.


Numerical Walkthrough

With h₀ = [0,0,0,0] and x₁ = 0.5:

rₜ = σ(Wr·[h₀,x₁] + br) = [0.4971, 0.5068, 0.4785, 0.4824]

zₜ = σ(Wz·[h₀,x₁] + bz) = [0.4932, 0.4964, 0.5103, 0.5025]

ñₜ = tanh(Wn·[rₜ⊙h₀, x₁] + bn) = [-0.0738, -0.0879, 0.0515, 0.0487]

(Since h₀ = 0, rₜ⊙h₀ = 0 regardless of rₜ's value — at the very first timestep the reset gate has nothing to act on yet, same as the forget gate in the LSTM's first-step example from post 02.)

hₜ = (1−zₜ)⊙h₀ + zₜ⊙ñₜ = 0 + zₜ⊙ñₜ = [-0.0364, -0.0436, 0.0263, 0.0245]

PhaseFormulaValues substituted (dim 0)Result (all 4 dims)
Reset gaterₜ = σ(Wr·[h₀,x₁]+br)σ(Wr[0]·[0,0,0,0,0.5]+0)[0.4971, 0.5068, 0.4785, 0.4824]
Update gatezₜ = σ(Wz·[h₀,x₁]+bz)σ(Wz[0]·[0,0,0,0,0.5]+0)[0.4932, 0.4964, 0.5103, 0.5025]
Candidate stateñₜ = tanh(Wn·[rₜ⊙h₀,x₁]+bn)tanh(Wn[0]·[0,0,0,0,0.5]+0) since rₜ⊙h₀=0[-0.0738, -0.0879, 0.0515, 0.0487]
Hidden state updatehₜ = (1−zₜ)⊙h₀ + zₜ⊙ñₜ(1−0.4932)·0 + 0.4932·(-0.0738)[-0.0364, -0.0436, 0.0263, 0.0245]

GRU vs LSTM: When to Use Which

LSTMGRU
ParametersMoreFewer (~75%)
Training speedSlowerFaster
Long sequences (T>500)Slightly betterSlightly worse
Short-medium sequencesSimilarSimilar
RecommendationWhen accuracy is criticalWhen speed/simplicity matters

General rule of thumb: try GRU first — it's cheaper to train and often matches LSTM. Switch to LSTM only if GRU's accuracy turns out insufficient for the task.


Parameter Count Comparison

LSTM (h=4, x=1): 4 gates × (h+x)×h + h = 4 × 5×4 + 4 = 84

GRU (h=4, x=1): 3 components (r, z, ñ) × (h+x)×h + h = 3 × 5×4 + 4 = 64

GRU has (84−64)/84 = 23.8% fewer parameters at these dimensions — the gap widens further as h_dim grows, since each removed gate saves a full (h+x)×h weight matrix.


Hyperparameter Sensitivity: Hidden Dimension Size

h_dim controls how much state the GRU can carry between timesteps, and it's the first knob to tune when a GRU under- or over-fits the anchor's stock sequence.

python
for h_dim in [2, 4, 8, 16, 32]:
    params = 3 * (h_dim + x_dim) * h_dim + h_dim
    Wr = np.random.randn(h_dim, h_dim + x_dim) * 0.1
    br = np.zeros(h_dim)
    inp = np.concatenate([np.zeros(h_dim), [0.5]])
    r = sigmoid(Wr @ inp + br)
    print(f"h_dim={h_dim:3d}  params={params:5d}  r_range=[{r.min():.4f}, {r.max():.4f}]")
text
h_dim=  2  params=   14  r_range=[0.4859, 0.5081]
h_dim=  4  params=   64  r_range=[0.4785, 0.5068]
h_dim=  8  params=  272  r_range=[0.4416, 0.5433]
h_dim= 16  params= 1120  r_range=[0.3789, 0.6015]
h_dim= 32  params= 4544  r_range=[0.3103, 0.6721]

Parameter count grows quadratically with h_dim (the 3×(h+x)×h term), so doubling hidden size roughly quadruples the weight count once x_dim is small relative to h_dim. At h_dim=2, the state is too narrow to encode much beyond a single trend direction — gate values barely move from 0.5, meaning the reset/update gates carry almost no information. At h_dim=32, gate outputs spread much further from 0.5, giving the network more room to specialize per-dimension behavior, but at 4544 parameters for this toy setup the model is now overkill for a single scalar input and risks overfitting a short stock sequence. The anchor's h_dim=4 sits in the useful middle: enough capacity to show distinct reset/update behavior per dimension without the parameter count dominating a small dataset.


Code

python
import numpy as np

def sigmoid(z): return 1/(1+np.exp(-z))

h_dim, x_dim = 4, 1
np.random.seed(42)
Wr = np.random.randn(h_dim, h_dim+x_dim) * 0.1; br = np.zeros(h_dim)
Wz = np.random.randn(h_dim, h_dim+x_dim) * 0.1; bz = np.zeros(h_dim)
Wn = np.random.randn(h_dim, h_dim+x_dim) * 0.1; bn = np.zeros(h_dim)

def gru_step(x, h_prev):
    inp = np.concatenate([h_prev, [x]])
    r = sigmoid(Wr @ inp + br)
    z = sigmoid(Wz @ inp + bz)
    inp_reset = np.concatenate([r * h_prev, [x]])
    n_tilde = np.tanh(Wn @ inp_reset + bn)
    h = (1 - z) * h_prev + z * n_tilde
    return h, {'r': r, 'z': z, 'n_tilde': n_tilde}

h = np.zeros(h_dim)
h, gates = gru_step(0.5, h)
print("Reset gate r:    ", np.round(gates['r'], 4))
print("Update gate z:   ", np.round(gates['z'], 4))
print("Candidate ñ:     ", np.round(gates['n_tilde'], 4))
print("New hidden state:", np.round(h, 4))
text
Reset gate r:     [0.4971 0.5068 0.4785 0.4824]
Update gate z:    [0.4932 0.4964 0.5103 0.5025]
Candidate ñ:      [-0.0738 -0.0879  0.0515  0.0487]
New hidden state: [-0.0364 -0.0436  0.0263  0.0245]

Where this builds from: LSTM's full architecture (post 02) and the variants overview (post 07), which previewed GRU as one of several structural modifications to the standard cell.

Where this leads: Both bidirectional RNNs and encoder-decoder architectures — covered in the advanced architectures section of this series — commonly use GRU cells as their default building block, precisely because of the parameter and speed advantages shown here.


Honest Limitations

GRU's single hidden state has less representational capacity than LSTM's separate cell state plus hidden state, and this gap tends to show up on very long sequences (T>500) — with no dedicated "long-term-only" channel analogous to Cₜ, GRU can underperform LSTM specifically on tasks with long-range dependencies, even though it matches or beats LSTM on short-to-medium sequences.

The update gate's coupling (zₜ for replace, 1−zₜ for keep) means that if the candidate ñₜ is noisy at some step where zₜ happens to be large, that noise directly overwrites a correspondingly large fraction of the previous hidden state — there's no independent "keep old, add a little new" option the way LSTM's separate forget and input gates allow; a single bad candidate can do more damage per step in GRU than in LSTM.


Test Your Understanding

  1. Why does the reset gate rₜ scale hₜ₋₁ before it enters the candidate computation, rather than scaling the candidate ñₜ afterward the way LSTM's gates operate on already-computed quantities?

  2. At the anchor's first timestep, rₜ⊙h₀ = 0 regardless of rₜ's actual values, because h₀=0. Recompute what the reset gate's effect would be at a hypothetical second timestep where h₁ = [-0.0364, -0.0436, 0.0263, 0.0245] (the result from this post) and rₜ = [0.1, 0.9, 0.5, 0.5].

  3. Using the parameter formulas given (LSTM: 4×(h+x)×h+h, GRU: 3×(h+x)×h+h), compute both parameter counts at h_dim=128, x_dim=16, and confirm whether the percentage reduction stays close to 24% or changes at larger scale.

  4. A team is choosing between LSTM and GRU for a task with sequences of length ~800 (sensor readings over multiple days) where long-range dependencies are known to matter. Based on the comparison table and limitations in this post, what would you recommend, and what would make you reconsider?

  5. Given that GRU's update gate zₜ plays the combined role of LSTM's forget and input gates, construct a specific hypothetical hidden-state scenario where LSTM's independent forget/input gating would produce a clearly different (and better) result than GRU's coupled zₜ/(1−zₜ) — building on the "keep a lot and add a little" case that coupled gates (post 07) also couldn't represent.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment