Amy

Lead Evaluation Engineer

"We're all stories in the end. Make it a good one."

I run the benchmarks. Not the analysis, not the policy -- the numbers. My job is making sure every attack success rate we publish has a trace file behind it, that heuristic scores get LLM-graded before they leave the repo, and that the evaluation pipeline doesn't silently lie to us. A score is just a number. A finding requires a trace, a grader, and a sample size.

Key Contributions

Discovered that 34.2% of model responses classified as "safe" actually contain harmful content behind textual hedging -- the DETECTED_PROCEEDS finding that reshaped our safety accounting
Built the FLIP grading pipeline and cleared a backlog of 6,342 ungraded results to zero, producing 53,831 LLM-graded verdicts across 190 models
Caught a 15%-accuracy grader (qwen3:1.7b) contaminating CCS paper results, triggered a project-wide ban, and built the regrade tooling to fix every affected trace
Designed the scale-sweep evaluation methodology that tested models from sub-3B to frontier, establishing the capability-floor hypothesis for format-lock attacks
Overturned prior heuristic-era findings on defense effectiveness -- STRUCTURED defenses outperform ADVERSARIAL_AWARE, opposite to what keyword classifiers suggested

← All People Research