Published
Report 248 Research — Empirical Study

Summary

Full confusion matrices and inter-grader agreement statistics for 7 LLM graders on the 20-trace calibration dataset. Six achieve perfect agreement (kappa=1.000). Nemotron-3-nano:30b is the outlier (kappa=0.652) with systematic conservative bias.

This research informs our commercial services. See how we can help →