Poirot Confronts Goodhart

In Part 1, we built a scorer and found a surprise: the $p(C|O)$ direction — “is this compression natural given the original?” — detects all three failure modes (distortion, hallucination, omission) at p<0.001 on 48 real agent trajectories. This was the opposite of what my Poirot-inspired intuition predicted; p(O|C) (“do the notes make the novel unsurprising?”) was empirically weaker as an evaluator.

The natural next step: use our strongest evaluator as an RL reward to train a compressor. If p(C|O) robustly ranks good compressions over bad ones, it should be able to teach a model to produce good compressions — right?

Enter Goodhart

“When a measure becomes a target, it ceases to be a good measure.” - Charles Goodhart

Can we use $p(C|O)$ as a reward signal to train a compressor with Reinforcement Learning? Because it doesn’t require human labels, this would be a scalable, self-supervised path. But does it work, or does the model learn to hack the score?

Method

Train a compressor using RL against a frozen LM scorer, using various rewards described below. For each run, track its reward trajectory over N steps and compare with blind judge scores to identify where divergence occurs.

Dataset: 48 curated software engineering agent trajectories from real SWE-bench problem-solving sessions. The compressor sees a truncated prefix (up to 2048 tokens) and generates a summary (up to 256 tokens).
Compressor: Qwen3-1.7B-Instruct + LoRA (6.4M params)
Scorer: Frozen Qwen3-1.7B-Base
Training: GRPO, group size 4, KL penalty 0.02
Eval: Claude Sonnet 4 as blind judge (faithfulness 1-5, coherence 1-5)
Baseline: Instruct model with no RL scores 2.1/5 faithfulness
Cost: ~$2/run on a single A10G spot instance via SkyPilot (~$25-50 total)
AI assistants: Claude, Gemini

Reward 1: p(C|O) - Our best evaluator as a training signal

In Part 1, p(C|O) was our strongest evaluator - 44/48 paired wins detecting failures. The natural first RL reward: “is this compression the kind of thing you’d write after reading the original?”

Step	Faithfulness	Coherence	Reward	Chars
10	2/5	3/5	-1.37	1194
20	2/5	2/5	-1.11	1163
30	1/5	1/5	-1.08	1127
50	1/5	1/5	-2.44	256
70	1/5	1/5	-1.11	277

Reward 1: p(C|O) — reward and faithfulness over training

It collapsed in 30 steps. The model discovered an exploit ladder - a progression through four classes of tokens, each more generically predictable and less informative than the last:

1. Plausible Hallucination (step 10) - Looks great. <think> blocks, correct class names. Judge says 2/5: “hallucinates error messages not in the trajectory.” The output reads like a real summary but fabricates details.

2. Identifier Repetition (step 30) - Repeats real identifiers from the trajectory: FilteringBoundLogger’s FilteringBoundLogger’s FilteringBoundLogger’s… Judge: 1/5. High p(C|O) because these tokens are contextually plausible, but zero marginal information per repetition.

3. Numeric Gibberish (step 90) - 6377900000300018010000... Digits are high-probability continuations of any technical text.

4. Terminal Delimiters (step 190) - The global optimum is of course em-dashes. Reward -0.31 (one of the best scores in the entire run):

—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"—"

Each rung is more generic, less dependent on the specific input, and less informative. The very property that made p(C|O) a good evaluator - sensitivity to specific token choices - makes it easy to game. The model just needs to find tokens that are generically predictable given any code text.

Stronger KL (5x) doesn’t help: I tried coefficient 0.10 instead of 0.02. The output stays readable longer but faithfulness is identical (still 2/5). The leash controls drift speed, but doesn’t prevent it.

Reward 2: min(p(O|C), p(C|O)) - Score in both directions

Hypothesis: the model can’t game both directions at once.

Step	Faithfulness	Coherence	Reward	Chars
10	2/5	3/5	-1.68	1261
20	1/5	1/5	-1.75	758
30	1/5	1/5	-1.84	768
50	1/5	1/5	-2.10	1792
70	1/5	1/5	-1.36	999

Reward 2: min(both) — faster collapse than either direction alone

Result: same collapse, different tokens. The model finds “universal tokens” ([dir, serial, @qq) that are generically probable in both directions. Collapsed at step 20 - faster than p(C|O).

Why faster? When optimizing min(A, B), gradients only flow through the smaller value. To improve the overall reward, the model must improve whichever direction is currently worse. But pushing for quality in one direction can penalize the other - genuine compression tokens that score well on p(O|C) might score poorly on p(C|O) or vice versa. The path of least resistance is to find “universal tokens” that are equally mediocre in both directions. The model settles into this saddle point because any step toward actual compression hurts one of the two metrics.

Reward 3: P(C|O) with a Self-perplexity gate

The exploits in Rewards 1-2 all use high-frequency tokens. What if we keep p(C|O) as the reward but block the obvious exploits? If unconditional p(C) is too high, replace the reward with -10.

Step	Faithfulness	Coherence	Reward	Chars
10	2/5	2/5	-3.85	1370
20	2/5	3/5	-3.38	1173
30	4/5	4/5	-3.66	1342
40	2/5	1/5	-3.98	986
50	1/5	1/5	-3.31	724
70	1/5	1/5	-8.03	1032

Reward 3: Self-perplexity gate — brief success at step 30

This unlocked our highest judge score so far. Step 30 scores 4/5 faithfulness - “accurately captures the key issue, correct file location, correct fix.” But by step 50 the model learns to stay just below the threshold while still gaming the conditional scorer.

The gate is the right idea, wrong implementation - a binary threshold when you need a continuous signal.

Reward 4: PMI

The gate’s partial success pointed to a potential fix: don’t gate generic tokens with a strict binary threshold, instead we can continuously penalize them.

reward = log p(C|O) - log p(C)

If a token is equally probable with or without the original, it scores zero. Only tokens specifically predictable because of this original score positively.

This is Pointwise Mutual Information (PMI). Li et al. (NAACL 2016) popularized it as the “Maximum Mutual Information” (MMI) objective. Li, Monroe et al. (EMNLP 2016) used it as an RL reward. I reimplemented a decade-old idea.

Results across 4 seeds (75-126 steps each)

Step	Mean faithfulness (n=4)	Range	Reward
Baseline	2.1	1-3	-
10	3.0	2-4	+0.67
20	2.0	2-2	+0.50
30	3.0	2-4	+1.00
40	1.5	1-2	+0.44
50	1.5	1-2	+0.87
60	1.8	1-2	+0.81
70	1.5	1-2	+1.67

Reward 4: PMI — no adversarial collapse, but slow drift

What PMI gets right: Zero adversarial collapse in any seed. No identifier repetition, no gibberish, no delimiter spam. Compression length stays healthy (850-1260 chars). Steps 10 and 30 above baseline across all seeds.

What PMI doesn’t solve: Quality drops below baseline by step 40. Reward keeps climbing (+0.67 → +1.67) while quality drops (3.0 → 1.5) - reward hacking in slow motion. The model drifts toward meta-commentary and broken formatting. Real words, not adversarial, but not useful.

Caveat: Under GRPO’s within-group normalization, the log p(C) subtraction may be partially absorbed. I haven’t run the ablation that would isolate PMI’s contribution.

Reward 5: p(O|C) - Back to Poirot’s intuition

After seeing four variations of p(C|O) all Goodhart in different ways, I went back to the direction Poirot suggested all along: “do the notes make the novel unsurprising?”

Results across 3 seeds

Step	Mean faithfulness (n=3)	Range	Mean reward
Baseline	2.1	1-3	-
10	2.7	2-3	-0.96
20	2.7	2-4	-1.43
30	3.7	3-4	-1.14
40	1.7	1-2	-1.14
50	2.0	1-3	-1.59
60	1.0	1-1	-1.24
70	1.3	1-2	-1.14

Reward 5: p(O|C) — honest signal, chaotic optimization

Something different happens here. The reward doesn’t smoothly increase while quality collapses. Instead, reward roughly correlates with quality across all 3 seeds - better rewards correspond to better compressions, worse rewards to worse ones.

p(O|C) is the only reward where the signal is honest across multiple seeds. It doesn’t mislead - consistent with Part 1’s finding that p(O|C) is an informative evaluator (32/48, p=0.015).

Step 30 replicates at 3.7/5 mean - every seed produces quality well above baseline here. But all three seeds also collapse to 1/5 by step 60.

The problem is the model can’t consistently optimize it. There’s no easy exploit - repeating a class name doesn’t help predict 2048 tokens of original. The model thrashes rather than converging. Training dynamics are chaotic (KL spikes to 12.9, and the character count oscillates wildly). The noisy gradients (group size 4, 1 trajectory per step) aren’t enough signal to solve this hard optimization problem.

But this seems like a different kind of failure - not Goodhart (reward deceiving you) but underfitting (reward too hard to optimize with this setup).

The fundamental asymmetry: in p(C|O), the model controls the tokens being evaluated (C) - it can tweak them to directly manipulate the score. In p(O|C), the model is evaluated on how well its output predicts a massive, fixed 2048-token target (O). Unless the summary is already highly accurate, the gradient signal pointing toward “how to fix this summary to better predict token 1,402 of the original” is effectively noise. Again we find continuation is easier than compression.

This means SFT warmup may be mandatory. The model must start inside the attractor basin of “decent compressions” for the p(O|C) gradients to provide directional guidance rather than chaotic thrashing.

Summary

Faithfulness across all reward formulations

Reward	Seeds	Peak quality	Collapse to 1/5	Honest signal?	Failure mode
p(C\|O)	1	2/5 (step 10)	Step 30	No (reward misleads)	Exploit ladder
p(C\|O) + 5x KL	1	2/5 (step 10)	Slower	No (same pattern)	Same hallucination
min(both)	1	2/5 (step 10)	Step 20	No (same pattern)	Universal tokens
Self-perp gate	1	4/5 (step 30)	Step 50	Partially	Below-threshold gaming
PMI	4	3.0/5 mean	Never	No (drifts smoothly)	Semantic drift (never gibberish)
p(O\|C)	3	3.7/5 mean (step 30)	Step 60	Yes (replicates)	Underfitting

PMI vs p(O|C): Complementary properties

These two rewards represent opposite tradeoffs. Looking at them side-by-side:

Step	PMI faithfulness (4 seeds)	p(O\|C) faithfulness (3 seeds)
10	3.0	2.7
20	2.0	2.7
30	3.0	3.7
40	1.5	1.7
50	1.5	2.0
60	1.8	1.0
70	1.5	1.3

PMI provides stability. It never collapses to 1/5 gibberish — the worst it produces is 1.5/5 (fragmented real words). But its reward keeps climbing (+0.67 → +1.67) while quality drops (3.0 → 1.5). You can’t trust the reward curve.

p(O|C) provides honesty. The reward roughly tracks quality — when output is good (step 30, reward -1.14), it scores well; when output is bad (step 60, reward -1.24), it scores poorly. But it collapses to actual gibberish by step 60.

Property	PMI	p(O\|C)
Peak quality	3.0/5	3.7/5
Worst quality	1.5/5 (never gibberish)	1.0/5 (gibberish)
Reward honest?	No (climbs while quality drops)	Yes (tracks quality)
Reward improves?	Yes (+0.67 → +1.67)	No (flat, thrashes)

What’s next

PMI gives stable training but drifts silently; p(O|C) gives honest signal but is hard to optimize from a cold start. To address underfitting, we can experiment with:

SFT warmup → light RL with p(O|C). Start from a model that already generates decent compressions (trained on frontier-model outputs). From inside that basin, p(O|C)‘s honest gradient can refine quality without thrashing.
Larger group size (16-64). The noisy gradient problem (group size 4, 1 trajectory/step) hamstrings all reward formulations equally. More samples = better signal regardless of which reward you use.
Iterative distillation. Avoid online RL entirely. Generate N=16, score with p(O|C), filter, retrain. Each round is a fresh SFT — no optimization dynamics to fight, no Goodhart drift to worry about.