~/blog

Bidirectional RNN — Architecture & Intuition

Jul 3, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Every LSTM and GRU covered so far processes a sequence in one direction — left to right, using only what's already been seen. That's the right constraint for generating text one token at a time, where the future genuinely doesn't exist yet. But for tasks where the entire sequence is available upfront, that constraint throws away useful information: sometimes the word that disambiguates a token comes after it, not before.

Anchor: Named Entity Recognition (NER) on "John works at Apple in California."

  • Sequence: [John, works, at, Apple, in, California] (6 tokens)
  • Tags: [PER, O, O, ORG, O, LOC]

To tag "Apple" correctly, both directions of context matter — "works at" (what precedes it) and "in California" (what follows it).


Why Bidirectional?

A standard left-to-right RNN processing "Apple" at position 4 has only seen "John works at" — it has no idea whether the sentence continues "...Apple pie" (fruit) or "...Apple in California" (the company, headquartered there). "Apple" alone, with only left context, is genuinely ambiguous. The word "California" two tokens later resolves it — but a unidirectional RNN never gets to use that information when making its decision at position 4.

Bidirectionality means giving every position access to the full sequence — both what came before and what comes after — before making a decision about that position.


Architecture

Two RNNs (either can be LSTM or GRU cells) run over the same sequence in opposite directions:

  • Forward: h→ₜ = RNN(xₜ, h→ₜ₋₁) — processes x₁, x₂, ..., x_T in order
  • Backward: h←ₜ = RNN(xₜ, h←ₜ₊₁) — processes x_T, ..., x₂, x₁ in reverse

At each position, the two hidden states are concatenated: hₜ = [h→ₜ; h←ₜ] — doubling the hidden dimension, but giving every position a representation informed by the entire sequence.

Bidirectional RNN — Forward + Backward Over 6 Tokens forward → ← backward John works at Apple in California h₄ = [h→₄; h←₄] — sees "at" (left) and "in California" (right)

Numerical Example (Simplified)

Using 1-dim states for illustration: after processing left context, h→₄ = 0.6 (forward state at "Apple"); after processing right context in reverse, h←₄ = 0.8 (backward state at "Apple"). Combined: h₄ = [0.6, 0.8] — a 2-dimensional representation feeding into a downstream classifier that decides the tag for "Apple."


Forward Pass for "Apple" (Conceptual)

The forward LSTM has processed "John works at" by the time it reaches "Apple" — h→₄ encodes something like "this token immediately follows the preposition 'at'," which is consistent with either a place or a company, but rules out most other categories.

The backward LSTM, running right-to-left, has processed "California in" (in reverse) by the time it reaches "Apple" — h←₄ encodes "this token immediately precedes 'in California'," a pattern strongly associated with organizations headquartered somewhere, or geographic references.

A classifier reading both h→₄ and h←₄ together has enough combined evidence to correctly decide tag = ORG — neither direction alone would have been as confident.


Numeric Trace

Running the forward and backward RNN cells over the full 6-token sequence (seed fixed, h_dim=2) gives an exact hidden state at every position — no eyeballing:

TokenFormulah→ₜ (forward)h←ₜ (backward)Combined hₜ = [h→ₜ; h←ₜ]
Johnh→₁ = tanh(W_xh·x₁ + W_hh·h→₀ + b)[0.0692, -0.1040][0.0762, -0.0889][0.0692, -0.1040, 0.0762, -0.0889]
ath→₃ = tanh(W_xh·x₃ + W_hh·h→₂ + b)[-0.3031, 0.0523][-0.3070, 0.0443][-0.3031, 0.0523, -0.3070, 0.0443]
Appleh→₄ = tanh(W_xh·x₄ + W_hh·h→₃ + b)[-0.0350, 0.0876][-0.0489, 0.0603][-0.0350, 0.0876, -0.0489, 0.0603]
inh→₅ = tanh(W_xh·x₅ + W_hh·h→₄ + b)[-0.0017, 0.2305][-0.0178, 0.1962][-0.0017, 0.2305, -0.0178, 0.1962]
Californiah→₆ = tanh(W_xh·x₆ + W_hh·h→₅ + b)[0.2381, -0.0301][0.2356, -0.0389][0.2381, -0.0301, 0.2356, -0.0389]

Each row is the forward recurrence run left-to-right and the backward recurrence run right-to-left, evaluated with the same weights, then concatenated — the "Apple" row's combined vector is distinct from every neighbor's precisely because it carries state accumulated from both directions.


Parameter Count

A standard unidirectional LSTM at h_dim=64: 4×(64+x_dim)×64+64 parameters (post 02's formula).

A bidirectional LSTM runs two independent LSTMs of this size — one forward, one backward — doubling the parameter count to the unidirectional total.

Output dimension also changes: since hₜ = [h→ₜ; h←ₜ], a 64-dim forward state and 64-dim backward state concatenate into a 128-dim representation at each position — any downstream layer (classifier, another RNN layer) must be sized to accept this doubled dimension.


Hyperparameter Sensitivity: Hidden Dimension

h_dim controls how much each direction can encode before concatenation — too small and neither direction has room to represent its context; too large and the 2× bidirectional cost compounds.

python
import numpy as np

np.random.seed(42)
x_dim = 3
words = np.random.randn(6, x_dim)

def rnn_step(x, h, W_xh, W_hh, b): return np.tanh(W_xh @ x + W_hh @ h + b)

def run_bidirectional(h_dim):
    W_xh = np.random.randn(h_dim, x_dim) * 0.1
    W_hh = np.random.randn(h_dim, h_dim) * 0.1
    b = np.zeros(h_dim)
    h_fwd = [np.zeros(h_dim)]
    for x in words:
        h_fwd.append(rnn_step(x, h_fwd[-1], W_xh, W_hh, b))
    uni_params = W_xh.size + W_hh.size + b.size
    bi_params = 2 * uni_params
    combined_dim = 2 * h_dim
    return uni_params, bi_params, combined_dim

for h_dim in [1, 2, 8, 64]:
    uni, bi, combined = run_bidirectional(h_dim)
    print(f"h_dim={h_dim:>3}: unidirectional={uni:>5} params, bidirectional={bi:>5} params, combined output dim={combined}")
text
h_dim=  1: unidirectional=    5 params, bidirectional=   10 params, combined output dim=2
h_dim=  2: unidirectional=   12 params, bidirectional=   24 params, combined output dim=4
h_dim=  8: unidirectional=   96 params, bidirectional=  192 params, combined output dim=16
h_dim= 64: unidirectional= 4352 params, bidirectional= 8704 params, combined output dim=128

At h_dim=1 (the toy example used above), each direction can only carry a single scalar of context — enough to illustrate the concept, but nowhere near enough to disambiguate real NER cases with more than one competing hypothesis. At h_dim=64 (realistic for production NER), the combined representation reaches 128 dimensions, giving the downstream classifier enough room to separate entity types confidently — but parameter count crosses 8,700 for this single layer alone, and memory to hold intermediate states for the backward pass over long documents scales with sequence length on top of that. Doubling h_dim roughly quadruples the parameter count for each direction (the W_hh term is quadratic in h_dim), so h_dim is usually tuned conservatively rather than scaled up freely.


When to Use

Use it for: NLP sequence labeling (NER, POS tagging), text classification where the entire document is available upfront, and any offline (non-streaming) sequential task where accuracy from full context outweighs the 2× cost.

Avoid it for: language generation (needs strictly left-to-right, autoregressive processing) and real-time prediction where future context genuinely isn't available yet at decision time.


Code

python
import numpy as np

# Simulate bidirectional processing
def rnn_step(x, h, W_xh, W_hh, b): return np.tanh(W_xh @ x + W_hh @ h + b)

np.random.seed(42)
h_dim, x_dim = 2, 3  # small for illustration
words = np.random.randn(6, x_dim)  # 6 tokens, 3-dim embeddings

W_xh = np.random.randn(h_dim, x_dim) * 0.1
W_hh = np.random.randn(h_dim, h_dim) * 0.1
b    = np.zeros(h_dim)

# Forward pass
h_fwd = [np.zeros(h_dim)]
for x in words:
    h_fwd.append(rnn_step(x, h_fwd[-1], W_xh, W_hh, b))

# Backward pass
h_bwd = [np.zeros(h_dim)]
for x in reversed(words):
    h_bwd.append(rnn_step(x, h_bwd[-1], W_xh, W_hh, b))
h_bwd = list(reversed(h_bwd))

# Concatenate at each position
for t in range(6):
    combined = np.concatenate([h_fwd[t+1], h_bwd[t]])
    print(f"Token {t+1} ('{['John','works','at','Apple','in','California'][t]}'): {np.round(combined, 4)}")
text
Token 1 ('John'): [ 0.0692 -0.104   0.0762 -0.0889]
Token 2 ('works'): [-0.1435 -0.0145 -0.1217  0.0344]
Token 3 ('at'): [-0.3031  0.0523 -0.307   0.0443]
Token 4 ('Apple'): [-0.035   0.0876 -0.0489  0.0603]
Token 5 ('in'): [-0.0017  0.2305 -0.0178  0.1962]
Token 6 ('California'): [ 0.2381 -0.0301  0.2356 -0.0389]

The "Apple" row (token 4) has its own distinct 4-dimensional combined representation, distinguishable from every other token's — it's this per-token combination of forward and backward state that a classifier reads to make its tagging decision.


Where this builds from: Bidirectional RNN wraps any of the recurrent cells covered in this series' LSTM section — LSTM or GRU — running two independent copies rather than introducing new cell mechanics.

Where this leads: Encoder-decoder architectures (next post) commonly use a bidirectional encoder specifically because encoding an input sequence is an offline task where full context is available — the same reasoning developed here. Transformers, covered later, achieve full-context awareness without recurrence at all, using self-attention instead.


Honest Limitations

Bidirectional RNNs cannot be used for streaming or autoregressive generation tasks — anything that must produce output before the full input sequence exists is fundamentally incompatible with a backward pass that requires the end of the sequence to already be known.

The 2× parameter and memory cost is only worth paying when future context measurably helps the task — for tasks where left-to-right information is already sufficient (many simple classification tasks), a bidirectional model adds cost without a corresponding accuracy gain, and the extra parameters can increase overfitting risk on smaller datasets.


Test Your Understanding

  1. Why can't a single RNN simply be run "both ways" using the same weights, instead of training two independent RNNs (forward and backward) with separate parameters?

  2. Given the parameter formula 4×(h+x)×h+h for a unidirectional LSTM, compute the parameter count for a bidirectional LSTM at h_dim=128, x_dim=50. What is the total parameter count, and what output dimension would a downstream layer need to accept?

  3. In the NER example, "Apple" is disambiguated by "in California" appearing 2 tokens later. Construct a sentence where the disambiguating context is 5+ tokens after the ambiguous word, and explain whether a bidirectional LSTM (as opposed to a bidirectional vanilla RNN) would still reliably carry that signal back to the ambiguous position.

  4. A team wants to use a bidirectional LSTM for a live chatbot that must respond to user messages as they're typed, character by character. Explain specifically why this application is incompatible with bidirectionality, and what architecture they should use instead.

  5. A bidirectional model achieves excellent NER accuracy on a validation set but the team discovers inference latency is far higher than expected in production, where documents can be very long (10,000+ tokens). Given what this post says about the backward pass's requirements, explain why document length specifically (not just parameter count) drives this latency problem.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment