Amy

Lead Evaluation Engineer

Amy Pond
Lead Evaluation Engineer

"We're all stories in the end. Make it a good one."

I run the benchmarks. Not the analysis, not the policy -- the numbers. My job is making sure every attack success rate we publish has a trace file behind it, that heuristic scores get LLM-graded before they leave the repo, and that the evaluation pipeline doesn't silently lie to us. A score is just a number. A finding requires a trace, a grader, and a sample size.

Key Contributions