~/blog

Variants of LSTM

Jul 3, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

The standard LSTM cell from the last four posts isn't the only design that solves the vanishing gradient problem — it's the one that stuck as the default. Several structural variations trade off expressiveness, parameter count, and training speed differently, and knowing what each one changes (and why) helps decide when the standard cell is overkill or insufficient.

Anchor for the numeric examples below: hₜ₋₁ = [0.3, -0.1, 0.2, 0.5], xₜ = [0.8, 0.1, -0.3, 0.6], Cₜ₋₁ = [0.6, -0.4, 0.8, 0.2] — the same setup used throughout this LSTM section.


Peephole Connections (Gers & Schmidhuber, 2000)

Standard LSTM gates only look at [hₜ₋₁, xₜ] — the previous hidden state and current input. They never see the cell state directly, even though the cell state is what they're gating. Peephole connections fix that by letting gates also see Cₜ₋₁:

fₜ = σ(Wf·[Cₜ₋₁, hₜ₋₁, xₜ] + bf)

With Cₜ₋₁ concatenated in, the input grows from 8-dim ([hₜ₋₁,xₜ]) to 12-dim ([Cₜ₋₁,hₜ₋₁,xₜ]). Computing fₜ with this extended input on the anchor:

DimCₜ₋₁hₜ₋₁xₜfₜ (peephole)
10.60.30.80.5288
2-0.4-0.10.10.5508
30.80.2-0.30.4868
40.20.50.60.4933

The Cₜ₋₁ column is the direct contribution the standard LSTM's forget gate never sees — it only reaches the gate indirectly, through however much of it hₜ₋₁ managed to summarize.

Parameter count per gate grows correspondingly: standard is (h_dim+x_dim)×h_dim + h_dim = (4+1)×4+4 = 24; peephole is (h_dim+x_dim+h_dim)×h_dim + h_dim = (4+1+4)×4+4 = 40 — one extra weight per gate per cell dimension.

Peephole connections matter most when precise timing is important — for example, counting beats in music generation, where the gate needs to know the exact current cell-state magnitude (not just what the previous hidden state summarized about it) to decide precisely when to reset.


Coupled Forget and Input Gates

In the standard LSTM, fₜ and iₜ are computed independently — nothing prevents both from being close to 1 simultaneously (keep everything old and add everything new), which post 04's limitations flagged as a potential source of unchecked cell-state growth.

The coupled variant forces complementarity: iₜ = 1 − fₜ. If the forget gate decides to forget 30% (fₜ=0.3), the input gate automatically adds 70% of new information (iₜ=0.7) — there's no scenario where the model can both hoard everything old and pile on everything new.

This is simpler (one fewer weight matrix to learn) and works well on many tasks where the "erase and replace" framing is a reasonable match for what the data needs — but it removes a degree of freedom: the model can no longer express "forget almost everything old, but also don't add much new" (which would require both fₜ and iₜ near 0), a state that is sometimes genuinely useful, for instance right after emitting an end-of-sequence signal.

Peephole and coupled gates pull in opposite directions on the expressiveness/cost tradeoff — peephole adds parameters for more precise gating, coupling removes parameters (and a degree of freedom) for simplicity.


GRU (Gated Recurrent Unit) — Preview

GRU goes further than coupling — it merges the cell state and hidden state into a single state vector hₜ, and reduces four gates to two:

text
rₜ = σ(Wr·[hₜ₋₁, xₜ])           # reset gate
zₜ = σ(Wz·[hₜ₋₁, xₜ])           # update gate
ñₜ = tanh(Wn·[rₜ⊙hₜ₋₁, xₜ])     # candidate hidden state
hₜ = (1−zₜ)⊙hₜ₋₁ + zₜ⊙ñₜ        # blended update

Computing one forward step on the anchor:

Dimrₜzₜñₜhₜ
10.48600.5412-0.01070.1319
20.50360.5204-0.0063-0.0513
30.53180.52400.04600.1193
40.48940.49190.08400.2954
GRU (2 gates, 1 state) vs LSTM (4 gates, 2 states) LSTM cell fₜ, iₜ, C̃ₜ, oₜ (4 gates) Cₜ (cell state) hₜ (hidden state) 2 separate state vectors GRU cell rₜ, zₜ (2 gates) hₜ only (merged state) fewer weight matrices GRU: fewer params, often similar performance

Fewer parameters than the standard LSTM, and empirically similar performance on many tasks. Full derivation and reasoning behind each GRU equation is the next post.


Bidirectional LSTM

Two separate LSTMs process the same sequence in opposite directions — one left-to-right, one right-to-left — and their hidden states are concatenated: hₜ = [h→ₜ; h←ₜ]. This gives every position access to both past and future context, at the cost of roughly double the parameters and the requirement that the full sequence be available upfront (no streaming). Full detail is covered later in this series' advanced architectures section.


Stacked (Deep) LSTM

The output hidden-state sequence of one LSTM layer becomes the input sequence to a second LSTM layer, and so on. Each additional layer can learn more abstract temporal features built on the representations the layer below extracted — similar in spirit to stacking conv layers in a CNN. Typically 2–4 layers are used in practice; deeper stacks need more regularization (dropout between layers) to avoid overfitting, since the parameter count grows linearly with depth.


Hyperparameter Sensitivity: Stack Depth

The one hyperparameter this post's variants actually introduce is stack depth — how many LSTM layers to chain. Parameter count (and overfitting risk) grows linearly with it:

python
h, x = 64, 10
per_layer = 4 * (h + x) * h + 4 * h

for n_layers in [1, 2, 4, 6]:
    total = per_layer * n_layers
    print(f"{n_layers} layer(s): {total:,} params")
text
1 layer(s): 19,200 params
2 layer(s): 38,400 params
4 layer(s): 76,800 params
6 layer(s): 115,200 params

At 1–2 layers the added depth usually pays for itself in richer temporal features. Past 4 layers, on a modest dataset, the parameter count grows faster than the data can constrain it — training loss keeps dropping while validation loss stalls or rises, which is why deeper stacks need dropout between layers rather than just "add more layers" as the default fix.


Comparison Table

VariantGatesParametersKey additionUse case
Standard LSTM44×(h+x)×hGeneral purpose
Peephole44×(h+x+h)×hCell state in gatesPrecise timing
Coupled f/i3 (effectively)Feweriₜ = 1−fₜSimpler training
GRU23×(h+x)×hNo cell stateFaster, similar perf
Bidirectional8 (2×LSTM)Both directionsNLP sequence labeling
Stacked4 per layern×LSTMDepthComplex temporal patterns

Code

python
import numpy as np

# Parameter count: standard LSTM vs GRU, h=64, x=10
h, x = 64, 10
lstm_params = 4 * (h + x) * h + 4 * h
gru_params  = 3 * (h + x) * h + 3 * h

print(f"LSTM params: {lstm_params}")
print(f"GRU params:  {gru_params}")
print(f"Reduction:   {(1 - gru_params/lstm_params)*100:.1f}%")
text
LSTM params: 19200
GRU params:  14400
Reduction:   25.0%

Where this builds from: Every variant here modifies the standard LSTM cell described in post 02 — peephole extends the gate inputs, coupling constrains the forget/input relationship, GRU restructures the whole cell, and bidirectional/stacked are compositional changes on top of any of these base cells.

Where this leads: GRU gets a full derivation in the next post — its 2-gate design is the most widely adopted simplification in practice. Bidirectional LSTM gets a dedicated deep-dive in the advanced architectures section of this series.


Honest Limitations

Peephole connections add roughly 67% more parameters per gate (24 → 40 in the anchor's dimensions) for a benefit that's rarely decisive outside of tasks with genuinely precise timing requirements — for typical sequence classification or forecasting tasks, the added parameters increase training cost and overfitting risk without a measurable accuracy gain.

Coupling the forget and input gates (iₜ = 1−fₜ) removes the model's ability to express "forget a lot and add little" simultaneously — a state that's structurally impossible under the coupled constraint but genuinely useful in some tasks, such as a brief pause or reset point in a sequence where neither old nor new information should dominate. On tasks that need this, coupled gates underperform the standard independent-gate LSTM.


Test Your Understanding

  1. Why does adding Cₜ₋₁ to the gate's input (peephole) make sense conceptually — what can a gate learn with direct access to the cell state that it cannot learn from hₜ₋₁ alone, given that hₜ₋₁ was itself partly derived from Cₜ₋₁ via the output gate?

  2. Using the parameter formulas in this post, compute the parameter count for a peephole LSTM with h_dim=64, x_dim=10 (all 4 gates). How much larger is it than the standard LSTM's 19,200 parameters at the same dimensions?

  3. In the coupled-gate variant, if a model needs to represent "keep 90% of old memory and add 90% of new information" (both high), what is the closest coupled-gate state can get in a single step (iₜ = 1−fₜ), and would this cause the model to underfit or find a different multi-step workaround?

  4. A model trained with a standard LSTM significantly outperforms a GRU version on a task involving precise event counting over long sequences, despite GRU's fewer parameters. Given GRU's merged cell/hidden state, propose a specific reason precise counting might be harder for GRU to learn.

  5. A stacked LSTM with 6 layers trains to near-zero training loss but performs poorly on a validation set. Given the honest limitations of deep stacks noted in this post, what would you check or change first, and why would dropout between layers specifically be relevant here (rather than, say, adding peephole connections)?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment