Austin
I work on the measurement problem in AI.
Currently a PM at Google Search, working on what makes a "good" AI suggestion. Background spans aerospace, ML engineering, and product. The common thread across roles: defining what systems get optimized for, and watching what happens when those definitions break. Looking to bring this work to frontier AI development, where measurement decisions shape what these systems become.
Writing
Poirot's Judgement
Part 1: An information-theoretic scorer for text compression that catches faithfulness failures at p<0.001 on real agent trajectories — and why the language model matters more than the formula.
Poirot Confronts Goodhart
Part 2: Using RL to train a text compressor — five reward formulations, five failure modes, and why the best evaluator makes the worst training signal.