~/blog

CTC Loss

Jul 3, 20267 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Standard cross-entropy requires aligned labels: for each input position, you must know the correct output. In speech recognition, a 3-second audio clip at 100 frames/second gives 300 input frames. The transcript might be "cat" — 3 characters. Which of the 300 frames corresponds to "c"? Which to "a"? Which to "t"? You don't know — and getting this alignment requires expensive human annotation.

CTC (Connectionist Temporal Classification) avoids the alignment problem entirely. Instead of requiring aligned labels, it sums over all possible alignments that produce the correct output string. The model is trained to assign high total probability to the correct string across all valid ways of expressing it in the sequence of frame-level predictions.

Anchor: speech recognition, 4 time frames (T=4). Vocabulary: {a, b, −, <blank>}. Target: "ab".

text
t=1: [a=0.6, b=0.1, -=0.1, blank=0.2]
t=2: [a=0.1, b=0.7, -=0.1, blank=0.1]
t=3: [a=0.1, b=0.1, -=0.1, blank=0.7]
t=4: [a=0.1, b=0.5, -=0.1, blank=0.3]

The Alignment Problem

Given T=4 frames and output "ab" (L=2), at every time step the model outputs probabilities over the vocabulary. We need a path through 4 time steps that collapses to "ab". But there are many valid paths:

  • [a, b, blank, blank] → remove blanks → [a, b] → merge duplicates → "ab" ✓
  • [a, blank, b, blank] → "ab" ✓
  • [a, blank, blank, b] → "ab" ✓
  • [blank, a, b, blank] → "ab" ✓
  • [a, a, b, blank] → remove blanks → [a, a, b] → merge consecutive → "ab" ✓ (the repeated 'a' merges!)
  • [blank, a, blank, b] → "ab" ✓

Collapsing rules:

  1. Remove all <blank> tokens
  2. Merge consecutive duplicate characters (but not characters separated by a blank)

So "a, a, b" → "ab" (merge consecutive a's). But "a, blank, a, b" → remove blank → [a, a, b] → "ab" (also collapses, because the blank-separated a's are still consecutive after blank removal). To keep two 'a's, you need "a, blank, a" in the original only if the final target is "aa".


CTC Loss Formula

P(y|x) = Σ_{π: B(π)=y} Π_t p(πₜ|t)

where B is the collapse function and the sum is over all valid paths π that collapse to the target y.

CTC Loss = −log P(y|x)

We want P("ab"|x) to be large — equivalently, CTC loss to be small.


Computing Path Probabilities

Using anchor model outputs:

Path [a, b, blank, blank]: p(a|t=1) × p(b|t=2) × p(blank|t=3) × p(blank|t=4) = 0.6 × 0.7 × 0.7 × 0.3 = 0.0882

Path [a, blank, b, blank]: p(a|t=1) × p(blank|t=2) × p(b|t=3) × p(blank|t=4) = 0.6 × 0.1 × 0.1 × 0.3 = 0.0018

Path [a, blank, blank, b]: p(a|t=1) × p(blank|t=2) × p(blank|t=3) × p(b|t=4) = 0.6 × 0.1 × 0.7 × 0.5 = 0.0210

Path [blank, a, b, blank]: p(blank|t=1) × p(a|t=2) × p(b|t=3) × p(blank|t=4) = 0.2 × 0.1 × 0.1 × 0.3 = 0.0006

Path [a, a, b, blank]: p(a|t=1) × p(a|t=2) × p(b|t=3) × p(blank|t=4) = 0.6 × 0.1 × 0.1 × 0.3 = 0.0018


Trace Table (Top 5 Valid Paths)

Patht=1t=2t=3t=4product
[a, b, blank, blank]0.60.70.70.30.088200
[a, blank, blank, b]0.60.10.70.50.021000
[a, blank, b, blank]0.60.10.10.30.001800
[a, a, b, blank]0.60.10.10.30.001800
[blank, a, b, blank]0.20.10.10.30.000600

These 5 paths alone contribute ≈ 0.1134. The full sum over all valid paths P("ab"|x) is computed by exhaustive enumeration (code below).


Forward-Backward DP

Naive enumeration is O(V^T) — for T=4, V=4: only 256 paths total, manageable. But for T=300, V=10000 in real ASR, V^T is astronomical.

CTC uses dynamic programming with two variables:

Forward α(t, s): probability of producing the first s characters of the extended label (target with blanks inserted) in the first t frames.

Backward β(t, s): probability of producing the remaining characters from frame t to T.

For target "ab" with blanks inserted: ⟨blank⟩a⟨blank⟩b⟨blank⟩ — this extended label has length 5 (L=2 characters + 3 blanks = 2L+1 = 5 positions).

DP table structures=1 (blank)s=2 (a)s=3 (blank)s=4 (b)s=5 (blank)
t=1α(1,1)α(1,2)000
t=2α(2,1)α(2,2)α(2,3)00
t=3α(3,1)α(3,2)α(3,3)α(3,4)0
t=4α(4,1)α(4,2)α(4,3)α(4,4)α(4,5)

P(y|x) = α(T, 2L+1) + α(T, 2L) — the probability of being at either the final blank or the final character at time T.

This DP runs in O(T × L) instead of O(V^T).


Code

python
import numpy as np
from itertools import product

# Anchor outputs
probs = np.array([
    [0.6, 0.1, 0.1, 0.2],  # t=1: a, b, -, blank
    [0.1, 0.7, 0.1, 0.1],  # t=2
    [0.1, 0.1, 0.1, 0.7],  # t=3
    [0.1, 0.5, 0.1, 0.3],  # t=4
])
vocab = ['a', 'b', '-', '<blank>']
blank_idx = 3

def collapse(path):
    # Remove blanks then merge consecutive duplicates
    no_blank = [c for c in path if c != '<blank>']
    result = []
    for c in no_blank:
        if not result or c != result[-1]:
            result.append(c)
    return ''.join(result)

target = "ab"
T = 4
total_prob = 0.0
valid_paths = []

for path_indices in product(range(4), repeat=T):
    path_chars = [vocab[i] for i in path_indices]
    if collapse(path_chars) == target:
        prob = np.prod([probs[t, path_indices[t]] for t in range(T)])
        valid_paths.append((path_chars, prob))
        total_prob += prob

valid_paths.sort(key=lambda x: -x[1])
print(f"Top 5 valid paths for '{target}':")
for path, prob in valid_paths[:5]:
    print(f"  {path} → prob = {prob:.6f}")
print(f"\nP('{target}'|x) = {total_prob:.6f}")
print(f"CTC Loss = -log P = {-np.log(total_prob):.4f}")
text
Top 5 valid paths for 'ab':
  ['a', 'b', '<blank>', '<blank>'] → prob = 0.088200
  ['a', '<blank>', '<blank>', 'b'] → prob = 0.021000
  ['a', 'b', '<blank>', 'b'] → prob = 0.010500
  ['a', '<blank>', 'b', '<blank>'] → prob = 0.001800
  ['a', 'a', 'b', '<blank>'] → prob = 0.001800

P('ab'|x) = 0.130500
CTC Loss = -log P = 2.0356

The path [a, b, blank, b] is also valid: remove blanks → [a, b, b] → merge consecutive b's → "ab" ✓. The model naturally finds the high-probability path (a=0.6, b=0.7, blank=0.7, blank=0.3) which contributes 0.0882 — nearly 68% of the total probability mass.


Where CTC Is Used

DeepSpeech (Baidu, 2014): End-to-end ASR with RNN + CTC. No forced alignment, no phoneme dictionary required — the model learns to map audio frames directly to character sequences via CTC.

OCR without bounding boxes: Given an image of a word, CTC can train a CNN-RNN to output the character sequence without knowing which pixel column corresponds to which character.

Wav2Vec (early versions): CTC was used to train representations from raw audio before attention-based architectures took over.


CTC applies cross-entropy (03-classification-losses.md) at each time step to produce frame-level probabilities, then sums those probabilities across valid paths. The softmax function (03-activations/07-softmax.md) is applied at each time step to convert logits to probabilities. The modern alternative to CTC is attention-based encoder-decoder (11-advanced/02-encoder-decoder.md): instead of marginalization over paths, the decoder attends to encoder states and predicts one output token at a time. Attention-based models generally outperform CTC on long sequences because they can model dependencies between output tokens explicitly.

Honest Limitations

CTC assumes conditional independence between output tokens given the input: the probability of the output sequence factorizes as a product of per-token probabilities at each frame. This means CTC cannot model "after predicting 'c', the next character is more likely to be a vowel." Language models must be integrated separately (beam search with an LM) to capture this. Attention-based decoders handle this natively.

CTC requires T ≥ L: the number of input frames must be at least as large as the output length. For very compressed audio or very long transcripts, this can be violated. In practice, CTC is applied after a subsampling layer (e.g., pooling) that reduces T by 2–4×, but this makes the T ≥ L constraint tighter.

CTC struggles with long-range output dependencies. If the model must decide whether to output "their" or "there" (homophones), it needs context from surrounding words — information that CTC's frame-level independence assumption cannot capture. This is the primary reason Whisper and modern ASR systems use attention-based decoders rather than CTC alone.


Test Your Understanding

  1. The path [a, blank, a, blank] collapsed: what does it produce? Is it a valid path for target "a"? For target "aa"?

  2. The top path [a, b, blank, blank] has probability 0.0882 = 0.6×0.7×0.7×0.3. Why does this single path dominate? What does the model learn from training examples where this path dominates?

  3. Compute the probability of path [blank, a, blank, b] on the anchor. The model assigns blank=0.2 at t=1. Show all 4 multiplications.

  4. CTC loss on anchor is 2.0356 nats. If the model were perfect (P("ab"|x)=1.0), what would the CTC loss be? What would it be for a random model assigning 0.25 to every token at every step? How many valid paths would a random model have?

  5. A CTC model processes a 100-frame audio input and produces a 50-character transcript. Is this valid? Now compress the audio by 4× (output 25 frames from the encoder before CTC). Is the model still valid? What is the constraint that determines validity?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment