Poirot's Judgement

“Perfection is attained not when there is nothing more to add, but when there is nothing more to remove.” - Antoine de Saint Exupéry

Information overload confronts us all - I wanted a way to locally compress/distill messy notes, emails, calendar info and more. I thought I could train my own model for the purpose.

First things first: How do you know if a compression is good? It’s not enough to be short, it’s got to capture the important information in the original text.

”Useful Compression” Intuition

“Did I not tell you that I was, like you, a very puzzled man? But at least we can face our problem. We can arrange such facts as we have with order and method.” - Hercule Poirot

When we try to achieve a goal—whether it’s the “Observe” and “Orient” steps in an OODA loop, the <think> block in an LLM output, or solving a fictional murder mystery—the first step is always to gather and organize sprawling, messy facts into a condensed format.

For Poirot, Agatha Christie’s famously confident detective, the raw data is the narrative itself: every scene, every conversation, every red herring. His “compression” is his notes on the case. If his notes are perfect, rereading the novel offers zero surprises. The notes captured everything that mattered.

Let’s formalize Poirot’s intuition. The sprawling original text is $O$ . The condensed notes are the compression $C$ . If $C$ is a high-quality compression, then $O$ becomes highly predictable. In information theory terms, the conditional probability of the original text given the notes, $p(O|C)$ , should be significantly higher than the unconditional probability $p(O)$ .

Golden Examples

Before building a scorer, I needed labeled data to test it against. I created 25 examples across 5 use cases:

Messy notes - raw meeting bullets → organized summary
Voice notes - rambling transcription → clear action items
Web articles - long-form article → key points
Multi-LLM context - multiple AI responses → unified synthesis
LLM context windows - code/PR descriptions → condensed handoff

Each example has an input, one “good” compression, and two “less good” variants. The less-good variants are specific failure modes:

Terse - too short, drops key information
Verbose - too long, basically a copy-paste rewording
Hallucinated - adds claims not in the original
Missing key info - readable but omits critical facts

The scorer’s job: given (input, compression), rank “good” above “less good” on every pair. 50 total pairs. Pass rate target: ≥80%.

Here’s a concrete example. The input is raw meeting notes:

Monday meeting with Sarah and Tom about Q2 planning

we need to ship the new dashboard by end of May

Sarah said design review Thursday

Tom thinks we should pull in Priya from platform

also Q2 planning, targets are 20% growth

dashboard blocker: API rate limits on staging

action item: me to file infra ticket for rate limits

Sarah: marketing wants to see dashboard before launch

Tom will sync with Priya when she’s back

The “good” compression preserves key facts with structure:

Goal: Ship new dashboard by end of May. Q2 target is 20% MAU growth. Blocker: API rate limits on staging - I’ll file infra ticket. Design review: Thursday 2pm, marketing joining. People: Tom wants Priya from platform; she’s on PTO, Tom will sync when she’s back.

The “less good terse” variant drops too much:

Ship dashboard by end of May

Design review Thursday

20% growth target

A good scorer should rank the first above the second every time.

Hill climbing toward a scorer

Building on the intuition above, if the compression is good, the original should be unsurprising given the compression, and shorter is generally better. We can use a language model to measure this. Each step below changes one thing from the previous attempt.

GPT-2 + raw perplexity: 60%

Formula: reward = -perplexity(O|C) - 0.5 * compression_ratio

Result: 30/50 pairs correct. Basically a coin-flip.

Failure modes: (1) verbose copy-paste wins (low perplexity, mild length penalty), (2) GPT-2 finds markdown/bullets “surprising”, likely because it was trained on web prose.

GPT-2 + lift per output token: 40%

Changed the formula to reward density: lift / len_tokens(C).

Result: 20/50. Short garbage outputs game the per-token ratio - a 3-word output gets huge “lift per token.”

GPT-2 + normalized lift (frac): 70%

Normalize by the original’s self-information. This is essentially the fraction of the original’s entropy that the compression removes — an approximation of information gain:

frac = (log p(O|C) - log p(O)) / -log p(O)

log p(O): how surprised the LM is by the original (baseline entropy)
log p(O|C): how surprised after reading the compression first (conditional entropy)
The difference is the lift (how much the compression helped)
Dividing by -log p(O) normalizes to 0-1 regardless of text difficulty

Result: 35/50. Better, but GPT-2’s style bias on markdown remains. Structured good examples still lose to unstructured prose rewrites.

GPT-2 + MDL (minimum description length): 46%

reward = -(α * len_tokens(C) + (-log p(O|C))) - total bits to encode compression + original-given-compression.

Result: 23/50. Prefers tiny garbage outputs (short = low first term).

NLI entailment (DeBERTa): 84%

Tried a completely different approach: extract claims from the compression, check if the original entails each one.

Result: 42/50. Better than anything GPT-2-based, but brittle.

The problem: claim extraction. “Extract individual claims from this markdown document” is itself a hard NLP problem. The extraction breaks on code blocks, nested bullets, and technical content. Two models to maintain (claim extractor + NLI verifier).

Gemma-4-E2B + normalized lift (frac): 94%

Since we hypothesized that GPT-2 was not used to bullets, we tried the same formula as the 70% attempt (frac), and just swapped GPT-2 for Gemma-4-E2B.

Result: 47/50 (94%). Large improvement, from 70% to 94%. The formula was fine all along - the LM was the problem.

Gemma understands structured text - markdown, bullets, code, tables - because it was trained on it.

Key insight: the LM matters more than the formula. The scorer IS the language model. If it can model your text distribution, the information-theoretic formula works. If it can’t, no formula will save you.

Does this work on real data?

My synthetic golden examples were a fine start, but I wanted to validate against actual data. Ideally something that mimics the meandering messiness of real work. So I turned to agent trajectories: 34K-92K tokens of tool calls, file reads, and reasoning from coding agents.

Specifically, I tested it on a dataset I made of 48 SWE-bench agent trajectories with both faithful compressions (from frontier LLMs) and adversarial compressions spanning three failure modes: distortion, hallucination, and omission.

A “win” means the scorer ranks the faithful compression above the adversarial one for that trajectory. I tested two scoring directions:

p(O|C) — Poirot’s direction: “do the notes make the novel unsurprising?”
p(C|O) — the reverse: “is this compression the kind of thing you’d naturally write after reading the original?”

Results via paired sign test:

Failure mode	p(O\|C) “Poirot”	p(C\|O) “naturalness”
All (pooled)	32/48, p=0.015	44/48, p<0.001
Distorted info	29/48, p=0.097	44/48, p<0.001
Hallucinated info	24/48, p=0.557	45/48, p<0.001
Missing key info	30/48, p=0.056	41/48, p<0.001

(For comparison, an embedding-cosine baseline (sentence-transformers) scored 6/48 — worse than chance. Surface similarity is actively misleading.)

Surprise: the Poirot direction loses. p(C|O) — asking “is this compression natural given the original?” — detects all three failure modes at p<0.001. Poirot’s own direction p(O|C) fails to reach significance on hallucination or omission.

I didn’t expect this. The intuition that “good notes make the novel unsurprising” is compelling but empirically weaker at evaluation than “good notes are the natural thing to write.” Why? My guess is that it has to do with the different sizes of text and their direction. We intentionally limited the compression to be smaller than the original. Therefore, p(C|O) starts with a large O and predicts C from that, where it’s more “overdeterimined”. Conversely, p(O|C) requires creating a smaller C that will predict and expand into a larger dataset. Continuation is easier than compression.

What’s next: Poirot Confronts Goodhart

The scorer works — 94% on labeled pairs, p<0.001 on all three trajectory failure modes. The natural next step: can we use it as an RL reward to train a compressor without curated labels? Part 2: Poirot Confronts Goodhart →