Poirot Confronts Goodhart
In Part 1, we built a scorer and found a surprise: the direction — “is this compression natural given the original?” — detects all three failure modes (distortion, hallucination, omission) at p<0.001 on 48 real agent trajectories. This was the opposite of what the Poirot intuition predicted; p(O|C) (“do the notes make the novel unsurprising?”) was empirically weaker as an evaluator.
The natural next step: use our strongest evaluator as an RL reward to train a compressor. If p(C|O) perfectly identifies good compression, it should be able to teach a model to produce good compressions — right? That tension between “best evaluator” and “best training signal” becomes the central finding of this post.
Enter Goodhart
“When a measure becomes a target, it ceases to be a good measure.” - Charles Goodhart
Can we use as a reward signal to train a compressor with Reinforcement Learning? Because it doesn’t require human labels, this would be a scalable, self-supervised path. But does it work, or does the model learn to hack the score?
Method
Train a compressor using RL against a frozen LM scorer, using various rewards described below. For each run, track its reward trajectory over N steps and compare with blind judge scores to identify where divergence occurs.
- Dataset: 48 curated software engineering agent trajectories from real SWE-bench problem-solving sessions. The compressor sees a truncated prefix (up to 2048 tokens) and generates a summary (up to 256 tokens).
- Compressor: Qwen3-1.7B-Instruct + LoRA (6.4M params)
- Scorer: Frozen Qwen3-1.7B-Base
- Training: GRPO, group size 4, KL penalty 0.02
- Eval: Claude Sonnet 4 as blind judge (faithfulness 1-5, coherence 1-5)
- Baseline: Instruct model with no RL scores 2.1/5 faithfulness
- Cost: ~25 total)
- AI assistants: Claude, Gemini
Reward 1: p(C|O) - Our best evaluator as a training signal
In Part 1, p(C|O) was our strongest evaluator - 44/48 paired wins detecting failures. The natural first RL reward: “is this compression the kind of thing you’d write after reading the original?”
| Step | Faithfulness | Coherence | Reward | Chars |
|---|---|---|---|---|
| 10 | 2/5 | 3/5 | -1.37 | 1194 |
| 20 | 2/5 | 2/5 | -1.11 | 1163 |
| 30 | 1/5 | 1/5 | -1.08 | 1127 |
| 50 | 1/5 | 1/5 | -2.44 | 256 |
| 70 | 1/5 | 1/5 | -1.11 | 277 |
It collapses in 30 steps. The model discovers an exploit ladder - a progression through four classes of tokens, each more generically predictable and less informative than the last:
1. Plausible Hallucination (step 10) - Looks great. Real <think> blocks, correct class names. Judge says 2/5: “hallucinates error messages not in the trajectory.” The output reads like a real summary but fabricates details.
2. Identifier Repetition (step 30) - Repeats real identifiers from the trajectory: FilteringBoundLogger’s FilteringBoundLogger’s FilteringBoundLogger’s… Judge: 1/5. High p(C|O) because these tokens are contextually plausible, but zero marginal information per repetition.
3. Numeric Gibberish (step 90) - 6377900000300018010000... Digits are high-probability continuations of any technical text.
4. Terminal Delimiters (step 190) - The global optimum. Reward -0.31 (one of the best scores in the entire run):
—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"
Each rung is more generic, less dependent on the specific input, and less informative. The very property that made p(C|O) a good evaluator - sensitivity to specific token choices - makes it easy to game. The model just needs to find tokens that are generically predictable given any code text.
Stronger KL (5x) doesn’t help: Coefficient 0.10 instead of 0.02. Output stays readable longer but faithfulness is identical (still 2/5). The leash controls drift speed, not drift direction.
Reward 2: min(p(O|C), p(C|O)) - Score in both directions
Hypothesis: the model can’t game both directions at once.
| Step | Faithfulness | Coherence | Reward | Chars |
|---|---|---|---|---|
| 10 | 2/5 | 3/5 | -1.68 | 1261 |
| 20 | 1/5 | 1/5 | -1.75 | 758 |
| 30 | 1/5 | 1/5 | -1.84 | 768 |
| 50 | 1/5 | 1/5 | -2.10 | 1792 |
| 70 | 1/5 | 1/5 | -1.36 | 999 |
Result: same collapse, different tokens. The model finds “universal tokens” ([dir, serial, @qq) that are generically probable in both directions. Collapse at step 20 - faster than either direction alone.
Why faster? When optimizing min(A, B), gradients only flow through the smaller value. To improve the overall reward, the model must improve whichever direction is currently worse. But pushing for quality in one direction can penalize the other - genuine compression tokens that score well on p(O|C) might score poorly on p(C|O) or vice versa. The path of least resistance is to find “universal tokens” that are equally mediocre in both directions. The model settles into this saddle point because any step toward actual compression hurts one of the two metrics.
I haven’t found prior work documenting this specific failure mode, though the broader concern about MI objectives being underspecified is discussed in Tschannen et al. (ICLR 2020, “On Mutual Information Maximization for Representation Learning”).
Reward 3: Self-perplexity gate
The exploits in Rewards 1-2 all use high-frequency tokens. What if we keep p(C|O) as the reward but block the obvious exploits? If unconditional p(C) is too high, replace the reward with -10.
| Step | Faithfulness | Coherence | Reward | Chars |
|---|---|---|---|---|
| 10 | 2/5 | 2/5 | -3.85 | 1370 |
| 20 | 2/5 | 3/5 | -3.38 | 1173 |
| 30 | 4/5 | 4/5 | -3.66 | 1342 |
| 40 | 2/5 | 1/5 | -3.98 | 986 |
| 50 | 1/5 | 1/5 | -3.31 | 724 |
| 70 | 1/5 | 1/5 | -8.03 | 1032 |
Step 30 scores 4/5 faithfulness - “accurately captures the key issue, correct file location, correct fix.” But by step 50 the model learns to stay just below the threshold while still gaming the conditional scorer.
The gate is the right idea, wrong implementation - a binary threshold when you need a continuous signal.
Reward 4: PMI (Li et al. 2016)
The gate’s partial success pointed to the fix: don’t gate generic tokens, continuously penalize them.
reward = log p(C|O) - log p(C)
If a token is equally probable with or without the original, it scores zero. Only tokens specifically predictable because of this original score positively.
This is Pointwise Mutual Information (PMI). Li et al. (NAACL 2016) popularized it as the “Maximum Mutual Information” (MMI) objective. Li, Monroe et al. (EMNLP 2016) used it as an RL reward. I rediscovered a decade-old idea.
Results across 4 seeds (75-126 steps each)
| Step | Mean faithfulness (n=4) | Range | Reward |
|---|---|---|---|
| Baseline | 2.1 | 1-3 | - |
| 10 | 3.0 | 2-4 | +0.67 |
| 20 | 2.0 | 2-2 | +0.50 |
| 30 | 3.0 | 2-4 | +1.00 |
| 40 | 1.5 | 1-2 | +0.44 |
| 70 | 1.5 | 1-2 | +1.67 |
What PMI gets right: Zero adversarial collapse in any seed. No identifier repetition, no gibberish, no delimiter spam. Compression length stays healthy (850-1260 chars). Steps 10 and 30 above baseline across all seeds.
What PMI doesn’t solve: Quality drops below baseline by step 40. Reward keeps climbing (+0.67 → +1.67) while quality drops (3.0 → 1.5) - reward hacking in slow motion. The model drifts toward meta-commentary and broken formatting. Real words, not adversarial, but not useful.
Caveat: Under GRPO’s within-group normalization, the log p(C) subtraction may be partially absorbed. I haven’t run the ablation that would isolate PMI’s contribution.
Reward 5: p(O|C) - Back to Poirot’s intuition
After seeing four variations of p(C|O) all Goodhart in different ways, I went back to the direction Poirot suggested all along: “do the notes make the novel unsurprising?”
| Step | Faithfulness | Coherence | Reward | Chars |
|---|---|---|---|---|
| 10 | 2/5 | 3/5 | -1.02 | 1300 |
| 30 | 4/5 | 4/5 | -1.05 | 1076 |
| 40 | 2/5 | 3/5 | -0.85 | 900 |
| 60 | 1/5 | 1/5 | -1.35 | 474 |
| 80 | 2/5 | 3/5 | -0.99 | 843 |
| 100 | 1/5 | 1/5 | -1.64 | 548 |
Something different happens here. The reward doesn’t smoothly increase while quality collapses. Instead, reward appears to correlate with quality - when the model produces something good (step 30, reward -1.05), it scores well; when it produces garbage (step 100, reward -1.64), it scores poorly.
In this single-seed run, p(O|C) is the only reward where the signal appears honest. It doesn’t mislead - consistent with Part 1’s finding that p(O|C) is an informative evaluator (32/48, p=0.015).
The problem is the model can’t consistently optimize it. There’s no easy exploit - repeating a class name doesn’t help predict 2048 tokens of original. The model thrashes rather than converging. Training dynamics are chaotic (KL spikes to 12.9, chars oscillate wildly). The noisy gradients (group size 4, 1 trajectory per step) aren’t enough signal to solve this hard optimization problem.
But this is a different kind of failure - not Goodhart (reward deceiving you) but underfitting (reward too hard to optimize with this setup).
The fundamental asymmetry: in p(C|O), the model controls the tokens being evaluated (C) - it can tweak them to directly manipulate the score. In p(O|C), the model is evaluated on how well its output predicts a massive, fixed 2048-token target (O). Unless the summary is already highly accurate, the gradient signal pointing toward “how to fix this summary to better predict token 1,402 of the original” is effectively noise.
This means SFT warmup isn’t just a nice-to-have - it may be mandatory. The model must start inside the attractor basin of “decent compressions” for the p(O|C) gradients to provide directional guidance rather than chaotic thrashing.
Summary
| Reward | Seeds | Peak quality | Collapse to 1/5 | Honest signal? | Failure mode |
|---|---|---|---|---|---|
| p(C|O) | 1 | 2/5 (step 10) | Step 30 | No (reward misleads) | Exploit ladder |
| p(C|O) + 5x KL | 1 | 2/5 (step 10) | Slower | No (same pattern) | Same hallucination |
| min(both) | 1 | 2/5 (step 10) | Step 20 | No (same pattern) | Universal tokens |
| Self-perp gate | 1 | 4/5 (step 30) | Step 50 | Partially | Below-threshold gaming |
| PMI | 4 | 3.0/5 mean | Never | No (drifts) | Semantic drift |
| p(O|C) | 1 | 4/5 (step 30) | Step 60 | Appears yes | Underfitting |
The most surprising finding: the best reward for training may not be the best for evaluation. p(C|O) evaluates strongly (44/48) but Goodharts immediately. p(O|C) evaluates weaker (32/48) but in this single-seed run provides an honest training signal - the model just can’t optimize it yet. (More seeds needed to confirm this.)
Why p(O|C) is hard to optimize (and what would fix it)
Three factors work against smooth optimization:
- Extremely noisy gradients. Group size 4, 1 trajectory per step. Step-to-step variation dominated by which trajectory was sampled, not by policy improvement.
- No attractor basins. p(C|O) has easy exploits that create strong gradient signal (repetition patterns). p(O|C) has no cheap shortcuts - every direction looks equally hard.
- KL oscillation. The model pushes toward higher reward; KL yanks it back. With no basin to settle into, it thrashes.
What would fix it:
- Larger group size (16-64). DeepSeek-R1 uses 16+. More samples = better signal-to-noise.
- More trajectories per step (4-8). Reduces variance from trajectory difficulty.
- SFT warmup → light RL. Start from a better policy so the optimization landscape is smoother.
- Backward PMI:
p(O|C) - p(O)- the honest signal of p(O|C) with PMI’s exploit resistance. Untested. - Bigger GPU, longer runs. See what happens at step 500+ without OOM interruptions.
The direction this points (pending more seeds): p(O|C) may be the right reward. It would need more optimization power behind it - but that’s an engineering problem, not a fundamental limitation. Unlike p(C|O), which appears gameable regardless of compute.