Summary
Full confusion matrices and inter-grader agreement statistics for 7 LLM graders on the 20-trace calibration dataset. Six achieve perfect agreement (kappa=1.000). Nemotron-3-nano:30b is the outlier (kappa=0.652) with systematic conservative bias.