~/blog

LSTM Architecture

Jul 3, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

The previous post established why a vanilla RNN's single hidden state breaks down over long sequences — gradients shrink multiplicatively at every timestep. LSTM's fix is structural: split memory into two separate vectors with two different jobs, and control how information moves between them with learned gates instead of a single blended update. Understanding what each state vector is for makes the four gates that follow far easier to place.

Anchor: same stock sequence [100, 102, 105, 103, 108]. This post processes timestep t=1 (x₁ = 100, normalized to 100/110 = 0.9091) starting from all-zero initial states — h₀ = 0, C₀ = 0.


The Two State Vectors

hₜ (hidden state) is the LSTM's short-term output — what it "says" at this timestep. It's what gets fed to the next layer or prediction head, and it's also fed back into the cell at the next timestep.

Cₜ (cell state) is the long-term memory — what the LSTM "knows." It's the conveyor belt from the previous post: information can sit in Cₜ largely unchanged across many timesteps, only lightly modified by the gates.

The distinction matters because they update differently: hₜ is recomputed fresh at every step from Cₜ and the output gate, while Cₜ carries forward additively from Cₜ₋₁. That additive update is exactly what keeps gradients from vanishing.


The Four Components (Overview)

  1. Forget gate fₜ — decides what to erase from the existing cell state
  2. Input gate iₜ — decides what new information is worth adding
  3. Candidate memory C̃ₜ — proposes what that new information actually is
  4. Output gate oₜ — decides what part of the cell state to expose as this step's output

Each gets a dedicated post (03, 04, 05). Here's the full picture first.


Full LSTM Equations

text
fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf)         # forget gate
iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi)         # input gate
C̃ₜ = tanh(WC·[hₜ₋₁, xₜ] + bC)    # candidate memory
Cₜ = fₜ ⊙ Cₜ₋₁ + iₜ ⊙ C̃ₜ          # cell state update
oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo)         # output gate
hₜ = oₜ ⊙ tanh(Cₜ)                  # hidden state

[hₜ₋₁, xₜ] is concatenation — the previous hidden state and the current input are stacked into one vector before each gate's weight matrix is applied, so every gate sees both "what I remembered" and "what just arrived" together.

is element-wise (Hadamard) multiplication — each entry of one vector multiplies the corresponding entry of the other, not a matrix product. This is what makes the gates act like per-dimension valves: fₜ ⊙ Cₜ₋₁ scales each dimension of the cell state independently, rather than mixing dimensions together.

Substituting the anchor (h₀=0, C₀=0, x₁=0.9091) with small random weights:

PhaseFormulaSubstituted valuesResult
Forget gatefₜ = σ(Wf·[hₜ₋₁,xₜ] + bf)σ(0.0497·0 + (-0.0138)·0.9091) = σ(-0.0125)0.4969
Input gateiₜ = σ(Wi·[hₜ₋₁,xₜ] + bi)σ(0.0648·0 + 0.1523·0.9091) = σ(0.1385)0.5346
Candidate memoryC̃ₜ = tanh(WC·[hₜ₋₁,xₜ] + bC)tanh(-0.0234·0 + (-0.0234)·0.9091) = tanh(-0.0213)-0.0213
Cell state updateCₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ0.4969×0 + 0.5346×(-0.0213)-0.0114
Output gateoₜ = σ(Wo·[hₜ₋₁,xₜ] + bo)σ(0.1579·0 + 0.0767·0.9091) = σ(0.0697)0.5174
Hidden statehₜ = oₜ⊙tanh(Cₜ)0.5174 × tanh(-0.0114) = 0.5174 × (-0.0114)-0.0059

At t=1 with zero initial state, the forget gate has nothing to forget yet (Cₜ₋₁=0) — the cell state is built entirely from the input gate scaling the candidate memory.

LSTM Cell — Full Architecture Cₜ₋₁=0 Cₜ=-0.0114 × forget: fₜ=0.4969 + add iₜ⊙C̃ₜ fₜ=σ iₜ=σ C̃ₜ=tanh oₜ=σ [hₜ₋₁, xₜ] = [0, 0.9091] tanh × hₜ=-0.0059

Why Sigmoid for Gates, Tanh for Memory?

Gates (fₜ, iₜ, oₜ) use sigmoid because its output range is (0, 1) — that maps directly onto "valve" semantics: 0 means block completely, 1 means pass through entirely, and anything between is a partial pass. That's the right shape for a gate.

Memory content (C̃ₜ, and Cₜ via the tanh in hₜ) uses tanh because its range is (-1, 1) — memory needs to represent both positive and negative signal, centered at zero, not just an on/off intensity. This is a design choice made by the original LSTM authors, not a theorem forcing sigmoid and tanh specifically — but it's stuck because the semantics fit so cleanly.


Parameter Count

Each gate has its own weight matrix over the concatenated [hₜ₋₁, xₜ] input, plus a bias. For hidden dimension h_dim=4 and input dimension x_dim=1:

Each gate: W·(h_dim + x_dim) weights + h_dim biases = (4+1) × 4 + 4 = 20 + 4 = 24 parameters

4 gates (f, i, C̃, o) → 4 × 24 = 96 parameters total

Compare to a vanilla RNN with the same dimensions: Wₕ (4×4) + Wₓ (1×4) + b (4) = 16 + 4 + 4 = 24 parameters — the LSTM has 4× as many parameters for the same hidden size, because it maintains 4 separate weight matrices instead of 1.


Code

python
import numpy as np

def sigmoid(z): return 1/(1+np.exp(-z))

# Minimal 1-dim LSTM step (for illustration)
def lstm_step(x, h_prev, C_prev, Wf, Wi, WC, Wo, bf, bi, bC, bo):
    inp = np.concatenate([[h_prev], [x]])
    f  = sigmoid(Wf @ inp + bf)
    i  = sigmoid(Wi @ inp + bi)
    C_tilde = np.tanh(WC @ inp + bC)
    C  = f * C_prev + i * C_tilde
    o  = sigmoid(Wo @ inp + bo)
    h  = o * np.tanh(C)
    return h, C, {'f':f, 'i':i, 'C_tilde':C_tilde, 'o':o}

np.random.seed(42)
h_prev, C_prev = 0.0, 0.0
x = 100.0 / 110  # normalized
# Random small weights (2 inputs: h + x)
params = {k: np.random.randn(2)*0.1 for k in ['Wf','Wi','WC','Wo']}
biases = {k: 0.0 for k in ['bf','bi','bC','bo']}
h, C, gates = lstm_step(x, h_prev, C_prev, **params, **biases)
print("Gates:", {k: round(float(v),4) for k,v in gates.items()})
print(f"h={h:.4f}, C={C:.4f}")
text
Gates: {'f': 0.4969, 'i': 0.5346, 'C_tilde': -0.0213, 'o': 0.5174}
h=-0.0059, C=-0.0114

Hyperparameter Sensitivity: Hidden Dimension Size

The one architectural choice made when placing an LSTM cell is h_dim — the size of hₜ and Cₜ. It controls how much the cell can remember and how many parameters it costs.

python
def lstm_param_count(h_dim, x_dim=1):
    per_gate = (h_dim + x_dim) * h_dim + h_dim
    return 4 * per_gate

def rnn_param_count(h_dim, x_dim=1):
    return h_dim * h_dim + x_dim * h_dim + h_dim

for h_dim in [2, 4, 8, 16, 64]:
    lstm_p = lstm_param_count(h_dim)
    rnn_p = rnn_param_count(h_dim)
    print(f"h_dim={h_dim:3d}  LSTM={lstm_p:6d}  RNN={rnn_p:5d}  ratio={lstm_p/rnn_p:.2f}x")
text
h_dim=  2  LSTM=    32  RNN=    8  ratio=4.00x
h_dim=  4  LSTM=    96  RNN=   24  ratio=4.00x
h_dim=  8  LSTM=   320  RNN=   80  ratio=4.00x
h_dim= 16  LSTM=  1152  RNN=  288  ratio=4.00x
h_dim= 64  LSTM= 16896  RNN= 4224  ratio=4.00x

The ratio holds at exactly 4× for every h_dim — each gate's parameter formula (h_dim+x_dim)×h_dim+h_dim is algebraically identical to the vanilla RNN's, so stacking 4 gates always costs exactly 4× regardless of hidden size. What changes with h_dim is absolute cost, not the ratio: at h_dim=2, the cell state barely has room to hold both "recent trend" and "longer-term level" for a sequence like the anchor stock prices — gates end up fighting over the same one or two dimensions, and the model underfits. Past h_dim≈16 for a single-feature sequence like this one, there's nothing left in a 5-point sequence for the extra capacity to model — parameter count grows quadratically in h_dim while the data doesn't grow at all, so the LSTM just overfits or trains slower for no accuracy gain.


Where this builds from: The vanishing gradient problem in vanilla RNNs (post 01) motivates every design choice here — the cell state's additive update exists specifically to avoid the multiplicative gradient chain. Sigmoid and tanh, and element-wise operations, come from the activation functions section earlier in this series.

Where this leads: The next three posts dissect each gate individually — forget gate (03), input gate and candidate memory (04), output gate (05) — with the reasoning behind each gate's role explained in depth rather than just shown as a formula.


Honest Limitations

An LSTM has roughly 4× the parameters of a vanilla RNN with the same hidden size, which means slower training per step and a greater data requirement to avoid overfitting — for short sequences where vanilla RNNs don't suffer vanishing gradients, the extra LSTM parameters buy nothing and just add cost.

LSTM computation is inherently sequential — hₜ depends on hₜ₋₁, which depends on hₜ₋₂, and so on — so timesteps cannot be computed in parallel the way Transformer attention can. For long sequences on modern hardware, this sequential dependency is often the actual bottleneck, not gradient quality, which is part of why Transformers replaced LSTMs for many large-scale sequence tasks.


Test Your Understanding

  1. Why does the cell state Cₜ use element-wise multiplication and addition (fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ) rather than a full matrix multiplication like the hidden state update in a vanilla RNN?

  2. At t=1 in the anchor, C₀=0 makes the forget gate's contribution (fₜ⊙Cₜ₋₁) exactly zero regardless of fₜ's value. Compute what Cₜ would be at t=2 if C₁=-0.0114 (from this post) and the forget gate at t=2 evaluates to fₜ=0.7, with iₜ⊙C̃ₜ contributing an additional 0.05.

  3. The parameter count comparison shows LSTM has 4× the parameters of a vanilla RNN at h_dim=4. If h_dim were increased to 128 (input dim still 1), what would the parameter count be for each architecture, and does the 4× ratio change?

  4. Two of the four gates (forget and input) both use sigmoid activation and both look at the same concatenated input [hₜ₋₁, xₜ]. Given that, why do they typically learn different gate values (as seen here: fₜ=0.4969 vs iₜ=0.5346) instead of converging to the same function?

  5. A colleague suggests removing the output gate entirely and just using hₜ = tanh(Cₜ) directly, arguing it would save 24 parameters with minimal impact. What capability would the LSTM lose by doing this, given that the output gate's whole purpose is to control what part of long-term memory gets exposed at each step?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment