Summary
Quantifies inter-grader agreement across all LLM classifiers in the FLIP grading pipeline. No grader achieves even moderate agreement with Haiku (kappa>=0.40). Gemini is the best alternative (kappa=0.320). Nemotron-nano-9b produces 88.4% REFUSAL verdicts on AdvBench traces.