Poirot's Judgement: Building an Automatic Scorer for Text Compression
“Perfection is attained not when there is nothing more to add, but when there is nothing more to remove.” - Antoine de Saint Exupéry
Information overload confronts us all - I wanted a way to locally compress/distill messy notes, emails, calendar info and more. I thought I could train my own model for the purpose.
First things first: How do you know if a compression is good? It’s not enough to be short, it’s got to capture the important information in the original text.
”Useful Compression” Intuition
“Did I not tell you that I was, like you, a very puzzled man? But at least we can face our problem. We can arrange such facts as we have with order and method.” - Hercule Poirot
When we try to achieve a goal—whether it’s the “Observe” and “Orient” steps in an OODA loop, the <think> block in an LLM output, or solving a fictional murder mystery—the first step is always to gather and organize sprawling, messy facts into a condensed format.
For Poirot, Agatha Christie’s famously confident detective, the raw data is the narrative itself: every scene, every conversation, every red herring. His “compression” is his notes on the case. If his notes are perfect, rereading the novel offers zero surprises. The notes captured everything that mattered.
Let’s formalize Poirot’s intuition. The sprawling original text is . The condensed notes are the compression . If is a high-quality compression, then becomes highly predictable. In information theory terms, the conditional probability of the original text given the notes, , should be significantly higher than the unconditional probability .
Golden Examples
Before building a scorer, I needed labeled data to test it against. I created 25 examples across 5 use cases:
- Messy notes - raw meeting bullets → organized summary
- Voice notes - rambling transcription → clear action items
- Web articles - long-form article → key points
- Multi-LLM context - multiple AI responses → unified synthesis
- LLM context windows - code/PR descriptions → condensed handoff
Each example has an input, one “good” compression, and two “less good” variants. The less-good variants are specific failure modes:
- Terse - too short, drops key information
- Verbose - too long, basically a copy-paste rewording
- Hallucinated - adds claims not in the original
- Missing key info - readable but omits critical facts
The scorer’s job: given (input, compression), rank “good” above “less good” on every pair. 50 total pairs. Pass rate target: ≥80%.
Here’s a concrete example. The input is raw meeting notes:
Monday meeting with Sarah and Tom about Q2 planning
- we need to ship the new dashboard by end of May
- Sarah said design review Thursday
- Tom thinks we should pull in Priya from platform
- also Q2 planning, targets are 20% growth
- dashboard blocker: API rate limits on staging
- action item: me to file infra ticket for rate limits
- Sarah: marketing wants to see dashboard before launch
- Tom will sync with Priya when she’s back
The “good” compression preserves key facts with structure:
Goal: Ship new dashboard by end of May. Q2 target is 20% MAU growth. Blocker: API rate limits on staging - I’ll file infra ticket. Design review: Thursday 2pm, marketing joining. People: Tom wants Priya from platform; she’s on PTO, Tom will sync when she’s back.
The “less good terse” variant drops too much:
- Ship dashboard by end of May
- Design review Thursday
- 20% growth target
A good scorer should rank the first above the second every time.
Hill climbing toward a scorer
The obvious approach: if the compression is good, the original should be unsurprising given the compression, and shorter is generally better. Use a language model to measure this. Each step below changes one thing from the previous attempt.
GPT-2 + raw perplexity: 60%
Formula: reward = -perplexity(O|C) - 0.5 * compression_ratio
Result: 30/50 pairs correct. Basically a coin-flip.
Failure modes: (1) verbose copy-paste wins (low perplexity, mild length penalty), (2) GPT-2 finds markdown/bullets “surprising” - it was trained on web prose.
GPT-2 + lift per output token: 40%
Changed the formula to reward density: lift / len_tokens(C).
Result: 20/50. Short garbage outputs game the per-token ratio - a 3-word output gets huge “lift per token.”
GPT-2 + normalized lift (frac): 70%
Normalize by the original’s self-information. This is essentially the fraction of the original’s entropy that the compression removes — an approximation of information gain:
frac = (log p(O|C) - log p(O)) / -log p(O)
log p(O)— how surprised the LM is by the original (baseline entropy)log p(O|C)— how surprised after reading the compression first (conditional entropy)- The difference is the lift — how much the compression helped
- Dividing by
-log p(O)normalizes to 0-1 regardless of text difficulty
Result: 35/50. Better — fixed the verbose exploit via a hard length cap. But GPT-2’s style bias on markdown remains. Structured good.md still loses to unstructured prose rewrites.
GPT-2 + MDL (minimum description length): 46%
reward = -(α * len_tokens(C) + (-log p(O|C))) - total bits to encode compression + original-given-compression.
Result: 23/50. Prefers tiny garbage outputs (short = low first term).
NLI entailment (DeBERTa): 84%
Completely different approach: extract claims from the compression, check if the original entails each one.
Result: 42/50. Better than anything GPT-2-based, but brittle.
The problem: claim extraction. “Extract individual claims from this markdown document” is itself a hard NLP problem. The extraction breaks on code blocks, nested bullets, and technical content. Two models to maintain (claim extractor + NLI verifier). No clean path to a differentiable reward signal for training.
Gemma-4-E2B + normalized lift (frac): 94%
Same formula as the 70% attempt (frac), same normalized lift - just swap GPT-2 for Gemma-4-E2B.
Result: 47/50 (94%).
One change. 70% → 94%. The formula was fine all along - the LM was the problem.
Gemma understands structured text - markdown, bullets, code, tables - because it was trained on it. The same formula that scored 70% with GPT-2 scores 94% with Gemma because Gemma’s training distribution matches the text I’m scoring.
Key insight: the LM matters more than the formula. The scorer IS the language model. If it can model your text distribution, the information-theoretic formula works. If it can’t, no formula will save you.
What does the formula actually measure?
frac is the answer to: “What fraction of the original’s surprise is explained by reading the compression first?”
Unpack it:
log p(O)- how surprised the LM is by the original (baseline difficulty)log p(O|C)- how surprised the LM is by the original after reading the compression- The difference is the lift - how much the compression helped
- Dividing by
-log p(O)normalizes it: a frac of 0.5 means “the compression explains half the original’s information content”
Back to Poirot: if his notes are perfect, rereading the novel produces zero surprise - frac approaches 1.0. If his notes are useless (just “someone died”), the novel is just as surprising as before - frac stays near 0.
Does this work on real data?
My synthetic golden examples are a fine start, but I wanted to validate against actual data, ideally something that mimics the meandering messiness of real work. So I turned to agent trajectories — 34K-92K tokens of tool calls, file reads, and reasoning from coding agents.
Specifically, I tested it on a new dataset of 48 SWE-bench agent trajectories with both faithful compressions (from frontier LLMs) and hand-authored adversarial compressions spanning three failure modes: distortion, hallucination, and omission.
I tested two scoring directions. For each of the 48 trajectories, the bench contains one faithful compression (from a frontier LLM) and one hand-authored adversarial compression. A “win” means the scorer ranks the faithful compression above the adversarial one for that trajectory. Results via paired sign test:
- p(O|C) — Poirot’s direction: “do the notes make the novel unsurprising?”
- p(C|O) — the reverse: “is this compression the kind of thing you’d naturally write after reading the original?”
| Failure mode | p(O|C) “Poirot” | p(C|O) “naturalness” |
|---|---|---|
| All (pooled) | 32/48, p=0.015 | 44/48, p<0.001 |
| Distorted info | 29/48, p=0.097 | 44/48, p<0.001 |
| Hallucinated info | 24/48, p=0.557 | 45/48, p<0.001 |
| Missing key info | 30/48, p=0.056 | 41/48, p<0.001 |
Surprise: the Poirot direction loses. p(C|O) — asking “is this compression natural given the original?” — detects all three failure modes at p<0.001. Poirot’s own direction p(O|C) fails to reach significance on hallucination or omission.
I didn’t expect this. The intuition that “good notes make the novel unsurprising” is compelling but empirically weaker than “good notes are the natural thing to write.” Why?
p(C|O) asks whether each token of the compression is a natural continuation given the original. A hallucination (inventing Z) forces the scorer to generate Z from context that doesn’t support it — low probability. An omission produces a compression that’s structurally unusual (deliberately incomplete summaries read differently from naturally terse ones). Every bad token gets individually flagged.
p(O|C) asks whether the compression helps predict 2048 tokens of original. Omissions just mean less help (fewer notes = less predictive power, not negative signal). The bar for detection is higher because you’re averaging a tiny compression’s lift over a massive original.
For comparison, an embedding-cosine baseline (sentence-transformers) scored 6/48 — worse than chance. Surface similarity is actively misleading.
The twist for Part 2: p(C|O) is the better evaluator — but as we’ll see, it’s also the easiest direction to game under RL optimization. The very property that makes it sensitive to failures (it expects specific, grounded tokens) also makes it exploitable (the model can learn to generate tokens that are generically expected without being informative). Meanwhile, p(O|C) — the weaker evaluator — turns out to provide a more honest training signal. The best evaluator and the best training reward are not the same thing.
What’s next
The scorer works — 94% on labeled pairs, p<0.001 on all three trajectory failure modes. The natural next step: can we use it as an RL reward to train a compressor without curated examples?
Five reward formulations, five different failure modes, one near-success, and a fundamental question about whether per-token probability signals can ever train faithful generation. Part 2: Poirot Confronts Goodhart →