The Zero-Frequency Problem in Text Compression

import math
from collections import Counter, defaultdict

1. Training Corpus

We intentionally use a tiny corpus to force sparsity.

training_text = """
the cat sat on the mat
the cat sat on the mat
the cat sat on the mat
the cat sat on the
""".strip().split()

counts = Counter(training_text)
N = sum(counts.values())
counts, N
(Counter({'the': 8, 'cat': 4, 'sat': 4, 'on': 4, 'mat': 3}), 23)

2. Unigram MLE and Compression Failure

Maximum Likelihood Estimation (MLE) estimates probabilities by counting:

\[P(w) = \frac{\text{count}(w)}{\text{total words}}\]

This is intuitive: if “the” appears 10 times out of 25 words, we estimate \(P(\text{the}) = 10/25 = 0.4\).

The problem: What if a word never appeared in training? Its count is 0, so:

\[P(\text{rug}) = \frac{0}{25} = 0\]

A probability of zero seems reasonable (“we never saw it!”), but it creates a catastrophic problem for compression…

def mle_prob(word, counts, N):
    return counts[word] / N

novel_word = "rug"
print("MLE P(rug) =", mle_prob(novel_word, counts, N))
MLE P(rug) = 0.0

A Quick Detour: Why is this Catastrophic for Compression?

Before we see why zero probability breaks everything, we need to understand what we mean by “bits required to encode” something.

Shannon’s Key Insight (1948): Information is surprise.

  • If I tell you “the sun rose today” — no surprise, no information. You already knew that.
  • If I tell you “a meteor struck campus” — huge surprise, lots of information.

Bits measure surprise mathematically:

\[\text{bits}(w) = -\log_2 P(w)\]

Event Probability Bits Intuition
Coin lands heads 0.5 1 bit One yes/no question: “Heads?”
Roll a 6 on a die 1/6 ≈ 0.167 ~2.6 bits Between 2-3 yes/no questions
Specific card from deck 1/52 ≈ 0.019 ~5.7 bits About 6 yes/no questions
Win the lottery 0.0000001 ~23 bits Enormous surprise!

Why this matters for compression:

Compression algorithms like arithmetic coding or Huffman coding work by assigning shorter codes to common symbols and longer codes to rare ones. The formula \(-\log_2 P(w)\) is the theoretical minimum — you literally cannot do better.

  • Common words like “the” (high probability) → few bits → short codes
  • Rare words like “quixotic” (low probability) → more bits → longer codes
  • Unseen words (zero probability) → infinite bits → impossible to encode

This is why probability models matter for compression: your probability estimates directly determine your code lengths.

The Catastrophe: Zero Probability

Now we see the problem: when \(P = 0\), we get \(-\log_2(0) = \infty\) bits.

The encoder literally cannot proceed. Arithmetic coding works by subdividing the number line according to symbol probabilities. A zero-probability symbol gets zero space — there’s nowhere to put it. The algorithm crashes.

def code_length(prob):
    return -math.log2(prob)

try:
    code_length(mle_prob(novel_word, counts, N))
except ValueError as e:
    print("Compression failure:", e)
Compression failure: math domain error

☝️ This is the zero-frequency problem in action. Python raises a ValueError because math.log2(0) is undefined.

The fix? We need to ensure every word has non-zero probability, even words we’ve never seen. This is called smoothing.

3. Laplace Smoothing (Unigrams)

Laplace smoothing (also called “add-one smoothing”) is the simplest fix: pretend every word was seen at least once.

\[P_{\text{Laplace}}(w) = \frac{\text{count}(w) + 1}{N + V}\]

Where: - +1 in numerator: Every word gets one “free” count - +V in denominator: We add \(V\) (vocabulary size) to keep probabilities summing to 1

The tradeoff: We “steal” probability mass from frequent words and redistribute it to rare/unseen words. Frequent words get slightly lower probabilities than MLE would give them.

V = len(counts)

def laplace_prob(word, counts, N, V):
    return (counts[word] + 1) / (N + V)

for w in ["the", "cat", "sat", "mat", "rug"]:
    print(f"{w:>4}: {code_length(laplace_prob(w, counts, N, V)):.2f} bits")
 the: 1.64 bits
 cat: 2.49 bits
 sat: 2.49 bits
 mat: 2.81 bits
 rug: 4.81 bits

☝️ Now “rug” has a finite code length! It takes more bits than common words (as it should), but it’s no longer impossible to encode.

Notice the cost: “the” now takes slightly more bits than pure MLE would suggest, because we redistributed some of its probability to unseen words.

4. Move to Bigrams

Now the sparsity problem becomes much worse.

Why Bigrams Make Everything Worse

With unigrams, we have \(V\) possible words. With bigrams, we have \(V^2\) possible pairs!

Model Possible Events Our Vocab (V=6)
Unigram \(V\) 6 words
Bigram \(V^2\) 36 pairs
Trigram \(V^3\) 216 triples

Most of those \(V^2\) pairs will never appear in training data. The zero-frequency problem explodes combinatorially.

bigrams = list(zip(training_text[:-1], training_text[1:]))
bigram_counts = Counter(bigrams)
unigram_counts = Counter(training_text)

bigram_counts
Counter({('the', 'cat'): 4,
         ('cat', 'sat'): 4,
         ('sat', 'on'): 4,
         ('on', 'the'): 4,
         ('the', 'mat'): 3,
         ('mat', 'the'): 3})

☝️ Look at the bigram counts. We only see a handful of the 36 possible pairs. Most combinations like (“mat”, “cat”) or (“on”, “sat”) have count zero.

5. Bigram MLE

Probability of w2 given w1.

Bigram MLE uses conditional probability — the probability of word \(w_2\) given that we just saw \(w_1\):

\[P(w_2 | w_1) = \frac{\text{count}(w_1, w_2)}{\text{count}(w_1)}\]

We divide by the count of the context word, not the total corpus size. This asks: “Of all the times we saw \(w_1\), how often was it followed by \(w_2\)?”

def bigram_mle(w1, w2, bigram_counts, unigram_counts):
    return bigram_counts[(w1, w2)] / unigram_counts[w1]

print("P(cat | the) =", bigram_mle("the", "cat", bigram_counts, unigram_counts))
print("P(rug | the) =", bigram_mle("the", "rug", bigram_counts, unigram_counts))
P(cat | the) = 0.5
P(rug | the) = 0.0

☝️ Key insight: Even though “the” is common and “rug” could plausibly follow it, we get \(P(\text{rug}|\text{the}) = 0\) because that specific combination never appeared in training.

This is worse than the unigram case — we don’t need a novel word to hit zero probability, just a novel sequence of known words!

6. Bigram Compression Failure

try:
    code_length(bigram_mle("the", "rug", bigram_counts, unigram_counts))
except ValueError as e:
    print("Bigram compression failure:", e)
Bigram compression failure: math domain error

☝️ Same crash, different cause. This time it’s not a novel word — both “the” and “rug” are in our vocabulary. It’s a novel bigram.

7. Laplace Smoothing for Bigrams

Bigram Laplace smoothing applies the same +1 trick, but now we add \(V\) to account for all possible next words:

\[P_{\text{Laplace}}(w_2 | w_1) = \frac{\text{count}(w_1, w_2) + 1}{\text{count}(w_1) + V}\]

Why +V? After seeing \(w_1\), there are \(V\) possible words that could follow. We’re adding 1 imaginary count for each possibility.

def bigram_laplace(w1, w2, bigram_counts, unigram_counts, V):
    return (bigram_counts[(w1, w2)] + 1) / (unigram_counts[w1] + V)

for w2 in ["cat", "rug"]:
    p = bigram_laplace("the", w2, bigram_counts, unigram_counts, V)
    print(f"P({w2} | the) = {p:.4f}{code_length(p):.2f} bits")
P(cat | the) = 0.3846 → 1.38 bits
P(rug | the) = 0.0769 → 3.70 bits

☝️ Both bigrams now have finite code lengths. “rug” following “the” costs more bits (it’s unexpected), but it’s encodable.

But there’s a problem… Laplace smoothing is quite aggressive. Adding 1 to every possible bigram means we’re giving a lot of probability mass to things that will probably never happen. This becomes wasteful with larger vocabularies.

8. Why Good–Turing and Backoff Exist

Laplace smoothing has problems at scale. Real systems use smarter techniques:

Backoff: If we haven’t seen a bigram, fall back to the unigram probability. Only use the lower-order model when the higher-order one has no data.

\[P(w_i | w_{i-1}) = \begin{cases} P_{\text{bigram}}(w_i | w_{i-1}) & \text{if count}(w_{i-1}, w_i) > 0 \\ \alpha \cdot P_{\text{unigram}}(w_i) & \text{otherwise} \end{cases}\]

Interpolation: Instead of choosing one or the other, blend all n-gram levels together:

\[P(w_i | w_{i-1}) = \lambda_1 \cdot P_{\text{bigram}}(w_i | w_{i-1}) + \lambda_2 \cdot P_{\text{unigram}}(w_i)\]

where \(\lambda_1 + \lambda_2 = 1\). This way, even seen bigrams get some influence from unigram statistics.

Discounting: Instead of adding counts (like Laplace), subtract a small amount from observed counts and redistribute that probability mass to unseen events. Good–Turing estimates how much to discount based on how many things occurred exactly once, twice, etc.

\[P_{\text{discounted}}(w) = \frac{\text{count}(w) - d}{N}\]

(Jurafsky & Martin chapter 3 covers these in detail.)

The Problem with Laplace Smoothing at Scale

Imagine a vocabulary of \(V = 50{,}000\) words: - Possible bigrams: \(V^2 = 2.5\) billion - Possible trigrams: \(V^3 = 125\) trillion

Laplace smoothing adds 1 to all of these — even absurd combinations like “the the the” or “banana quantum fishbowl.” We’re wasting probability mass on events that will never occur.

But there’s a subtler problem: what probability should we use when backing off?

The “San Francisco” Problem: Why Context Matters for Backoff

Consider these two words in a large corpus:

Word Raw Count Contexts It Follows
Francisco 5,000 “San” (4,990), “Pope” (10)
Tuesday 5,000 “last”, “next”, “every”, “on”, “until”, … (hundreds)

Both words appear 5,000 times. But look at their contexts: - “Francisco” almost only follows “San” — it’s trapped in one bigram - “Tuesday” follows hundreds of different words — it’s versatile

Now imagine we need to back off:

We encounter a novel bigram like “I visited Francisco” — we’ve never seen “visited Francisco” in training. We need to estimate P(Francisco | visited) by backing off to some unigram-like probability.

The problem with raw frequency: If we use raw counts, P(Francisco) ≈ P(Tuesday). But intuitively, “I visited Tuesday” makes more sense than “I visited Francisco” (without “San”).

Kneser-Ney’s insight: Instead of raw frequency, count how many different contexts a word appears in (its continuation count):

\[P_{\text{continuation}}(w) = \frac{|\{v : \text{count}(v, w) > 0\}|}{|\text{all bigram types}|}\]

  • “Francisco”: appears after ~2 distinct words → low continuation probability
  • “Tuesday”: appears after ~500 distinct words → high continuation probability

This is why Kneser-Ney outperforms Laplace: when backing off, it asks “how versatile is this word?” not “how frequent is this word?”

9. LLM Connection

  • Subword tokenization limits true novelty
  • Parameter sharing smooths across contexts
  • Softmax ensures no zero probabilities

Same zero-frequency problem — different machinery.

From N-grams to Neural Language Models

Modern LLMs like GPT and Claude face the exact same zero-frequency problem — but solve it differently:

Classical N-grams Modern LLMs
Discrete word counts Continuous embeddings
Explicit smoothing (+1, backoff) Implicit smoothing via parameter sharing
Fixed context (n-1 words) Attention over long contexts
Zero prob for unseen n-grams Softmax ensures all tokens > 0

The fundamental insight remains: you can’t assign zero probability to anything you might need to encode.


🎯 Key Takeaways

  1. Zero counts → zero probability → infinite bits — compression is impossible
  2. Smoothing redistributes probability mass to ensure nothing has \(P = 0\)
  3. Higher-order n-grams make sparsity exponentially worse (\(V^n\) possible events)
  4. Laplace smoothing is simple but wasteful; real systems use smarter methods
  5. LLMs solve the same problem with neural networks, but the core issue is identical