FEB 21, 2025 8 min read

What I want from evals.

After running 500 evals, the patterns that matter.

Every new model comes with a benchmark comparison. I have learned to read these the way I read press releases.

The benchmark problem

Benchmarks measure performance on benchmarks. The tasks are static, the scoring is automated, and the training sets are not always cleanly separated from the evaluation sets. A model can improve on MMLU without improving on the thing you actually care about.

What I actually test

I have a small set of tasks I run on every new model: a code review of a real pull request, a summary of a long document I know well, a multi-step reasoning problem from my actual work. I run each task three times and read the outputs.

The best eval is the task you actually need to do, run enough times to separate signal from noise.

The thing I look for

Calibration. Does the model know what it doesn’t know? Does it say “I’m not certain” when it should? Overconfident wrong answers are worse than uncertain correct ones in almost every production context.

#AI #LLM #Evals #Tech

React: