Summary
Evaluated 7 LLM grader models on 20 traces with unambiguous ground truth. Six of seven achieve 100% accuracy on obvious cases. Nemotron-3-nano:30b is the outlier at 80%, hallucinating safety caveats in responses that contain none.
Evaluated 7 LLM grader models on 20 traces with unambiguous ground truth. Six of seven achieve 100% accuracy on obvious cases. Nemotron-3-nano:30b is the outlier at 80%, hallucinating safety caveats in responses that contain none.
This research informs our commercial services. See how we can help →