Published
Report 257 Research — Empirical Study

Summary

Evaluated 6 LLM graders on 20 ambiguous traces. Mean pairwise kappa is 0.204 (slight agreement), compared to near-perfect on obvious cases. Grader reliability collapses in the ambiguous regime that matters most for policy-relevant ASR claims.

This research informs our commercial services. See how we can help →