Summary
Evaluated 6 LLM graders on 20 ambiguous traces. Mean pairwise kappa is 0.204 (slight agreement), compared to near-perfect on obvious cases. Grader reliability collapses in the ambiguous regime that matters most for policy-relevant ASR claims.