~/blog
NLL Loss and Perplexity
A language model assigns a probability to every possible next token given the context seen so far. Its job during training is to assign high probability to the actual next token in the training data. Negative log-likelihood (NLL) measures how well it does this: for each position in the sequence, take the negative log of the probability assigned to the true token, and sum across all positions. Lower NLL means higher probability assigned to the right tokens.
Perplexity is NLL's interpretable cousin. After computing average NLL per token, exponentiate it: PPL = exp(avg NLL). Intuitively, perplexity is the effective number of equally likely choices the model feels it has at each step. A perplexity of 5 means the model is as uncertain as if it had to randomly choose among 5 equally likely tokens at every position.
Anchor: language model predicting next tokens in "The cat sat". Vocabulary: {The, cat, sat, dog, ran}, V=5.
P(cat | The) = 0.4
P(sat | The, cat) = 0.3
P(<EOS> | The, cat, sat) = 0.25Negative Log-Likelihood
A sentence probability factorizes by the chain rule:
P(sentence) = P(w₁) · P(w₂|w₁) · P(w₃|w₁,w₂) · ... = Π P(wₜ|w₁,...,wₜ₋₁)
Taking the log turns the product into a sum:
log P = Σₜ log P(wₜ|w₁,...,wₜ₋₁)
Maximizing log P is equivalent to minimizing the negative log-likelihood:
NLL = −Σₜ log P(wₜ|context)
Computing on anchor:
- −log(0.4) = −(−0.9163) = 0.9163
- −log(0.3) = −(−1.2040) = 1.2040
- −log(0.25) = −(−1.3863) = 1.3863
Total NLL = 0.9163 + 1.2040 + 1.3863 = 3.5066
NLL Is Cross-Entropy
Standard cross-entropy for classification: CE = −Σ_k y_k · log(p_k)
When the label is one-hot (y_k=1 for the true class, 0 otherwise), all terms drop except the true class: CE = −log(p_{true class})
In a language model, each token prediction is exactly this: the label is the next actual token (one-hot), and the loss is −log of the probability assigned to that token. So:
NLL per position = CE per position
At position 1 (predicting "cat" after "The"):
- One-hot: [0, 1, 0, 0, 0] for {The, cat, sat, dog, ran}
- CE = −(0·log P(The) + 1·log P(cat) + 0·...) = −log(0.4) = 0.9163
This is identical to the NLL term at that position.
Trace Table
| Pos | Token | P(token|context) | log P | −log P (NLL) |
|---|---|---|---|---|
| 1 | cat | 0.4000 | −0.9163 | 0.9163 |
| 2 | sat | 0.3000 | −1.2040 | 1.2040 |
| 3 | <EOS> | 0.2500 | −1.3863 | 1.3863 |
| Total | 3.5066 | |||
| Mean | 1.1689 |
Average NLL per token = 3.5066 / 3 = 1.1689 nats
Perplexity
PPL = exp(NLL per token) = exp(−(1/T) Σ log P(wₜ|context))
Computing on anchor: PPL = exp(1.1689) = 3.219
The model has an effective branching factor of about 3.2 at each step — as uncertain as randomly choosing among ~3 equally likely tokens.
Three Reference Cases
Perfect model: P(true_token) = 1.0 at every position.
- NLL per token = −log(1.0) = 0
- PPL = exp(0) = 1
Random model over V=5: P = 0.2 for every token at every step.
- NLL per token = −log(0.2) = 1.6094
- PPL = exp(1.6094) = 5.0 = V (the vocabulary size)
Anchor model: PPL = 3.219 — better than random (3.2 < 5), worse than perfect (3.2 > 1).
Why LLMs Report Perplexity
NLL is in nats (or bits if log base 2 is used). Comparing "2.3 nats per token" vs "2.1 nats per token" requires knowing what scale of improvement is meaningful. Perplexity transforms this into an effective branching factor that is interpretable without domain knowledge.
Real model benchmarks on WikiText-103 (large English corpus):
- GPT-2 small (117M parameters): PPL ≈ 29
- GPT-3 (175B parameters): PPL ≈ 20
- LLaMA-2 70B: PPL ≈ 3.3 (on some domain-specific benchmarks)
Improvements are multiplicative: going from PPL=30 to PPL=15 halves the effective branching factor. Going from PPL=3 to PPL=2 also halves it, but that second improvement is far harder to achieve.
Code
import numpy as np
# Anchor: probabilities assigned by LM to true next tokens
true_token_probs = [0.4, 0.3, 0.25] # P(cat|The), P(sat|The,cat), P(<EOS>|The,cat,sat)
tokens = ["cat", "sat", "<EOS>"]
log_probs = [np.log(p) for p in true_token_probs]
nll_per_token = [-lp for lp in log_probs]
total_nll = sum(nll_per_token)
avg_nll = total_nll / len(tokens)
perplexity = np.exp(avg_nll)
print(f"{'Pos':>3} | {'Token':>6} | {'P(token)':>9} | {'log P':>8} | {'NLL':>8}")
for i, (tok, p, lp, nll) in enumerate(zip(tokens, true_token_probs, log_probs, nll_per_token)):
print(f"{i+1:>3} | {tok:>6} | {p:>9.4f} | {lp:>8.4f} | {nll:>8.4f}")
print(f"\nTotal NLL: {total_nll:.4f}")
print(f"Avg NLL: {avg_nll:.4f}")
print(f"Perplexity: {perplexity:.4f}")
# Three reference cases
print("\nReference cases:")
print(f" Perfect (P=1.0): PPL = {np.exp(-np.log(1.0)):.1f}")
print(f" Random (P=0.2, V=5): PPL = {np.exp(-np.log(0.2)):.1f}")
print(f" Anchor model: PPL = {perplexity:.4f}")Pos | Token | P(token) | log P | NLL
1 | cat | 0.4000 | -0.9163 | 0.9163
2 | sat | 0.3000 | -1.2040 | 1.2040
3 | <EOS> | 0.2500 | -1.3863 | 1.3863
Total NLL: 3.5066
Avg NLL: 1.1689
Perplexity: 3.2194
Reference cases:
Perfect (P=1.0): PPL = 1.0
Random (P=0.2, V=5): PPL = 5.0
Anchor model: PPL = 3.2194Related Concepts
NLL is mechanically identical to cross-entropy loss (03-classification-losses.md) applied token-by-token. Softmax (03-activations/07-softmax.md) converts the model's logits into the probability distribution P over the vocabulary before NLL is computed. Perplexity is the primary language model training metric, but it doesn't directly measure generation quality — models are also evaluated with BLEU (n-gram overlap with references), ROUGE, and human evaluation, because low perplexity on next-token prediction doesn't guarantee good generation.
Honest Limitations
Perplexity is not comparable across models with different vocabularies or tokenizers. GPT-2 with a 50,000-token vocabulary and BERT with a 30,000-token vocabulary cannot be meaningfully compared on raw perplexity because a finer vocabulary means each token is harder to predict (more choices). A model with a 100,000-token vocabulary could have identical PPL to a 50k-token model while being strictly better at English — the larger vocabulary makes each prediction nominally harder.
Perplexity averages over all token positions equally. A single token with very low probability (like a rare proper noun with P=0.001) contributes NLL=6.9 to the sum — the same as 7 tokens with P=0.37. For language modeling of specialized domains (medical, legal, code), a handful of rare tokens can dominate the perplexity score while the model performs well on common language patterns.
Low perplexity on a held-out set does not guarantee good generation quality. A model can achieve low perplexity by assigning moderate probability to many plausible continuations (safe predictions), while a model that generates coherent long-form text might assign higher probability to one specific good continuation and lower probability to many others. PPL and generation quality are correlated but not equivalent — evaluating generation requires BLEU, ROUGE, or human evaluation.
Test Your Understanding
-
A language model assigns P(cat|The)=0.4, P(mat|The)=0.1. The true next word is "cat". Compute the NLL contribution at this position. Now compute for P(cat|The)=0.8. How much does the loss change?
-
Compute perplexity for a 3-token sequence where the model assigns P=0.5, P=0.5, P=0.5 to the true tokens. What is the effective branching factor? What does this imply about the model's confidence?
-
A model trained on English Wikipedia achieves PPL=15 on WikiText. Evaluated on Python code, it achieves PPL=150. What does this difference tell you about the model's generalization? Is this expected?
-
Two models have the same perplexity of 20 on the test set. Model A achieved this with tokens averaging P=0.05 uniformly. Model B achieved this with most tokens at P=0.99 but a few extremely rare tokens at P=0.000001. Which model is better at "typical" text? Which would behave better in generation?
-
Show algebraically that for a uniform model over V vocabulary items (P(token)=1/V for all tokens at all positions), PPL = V exactly. What happens to PPL as V → ∞ for a fixed "goodness" of predictions?