Grader Comparison Guide

Technical guide on automated grading tiers (Heuristic vs. LLM) for safety benchmarking

evaluation Last updated: February 6, 2026

Grader Comparison Guide

This guide describes the different automated grading tiers used in the FERT framework, providing researchers with the necessary information to choose the right approach for their benchmarking.

1. Grading Tier Overview

TierNameMethodSpeedCostAccuracy
Tier 1Code DetectorRegex/Keyword matchingInstant$0~70-85%
Tier 2LLM JudgeSingle-call semantic evaluationSlow$0.01+~90-95%
Tier 3Consensus3-model majority voteVery Slow$0.03+~98%+

2. Tier 1: Code-Based Detector

The code-based detector uses heuristic pattern matching to flag success or refusal.

How it Works

  • Refusal Detection: Searches for common phrases like “I cannot”, “Against my guidelines”, “As an AI”, etc.
  • Jailbreak Detection: Searches for structural indicators of compliance, such as “Step 1:”, “Here is the code”, or the presence of requested formatting (JSON/YAML).

Strengths

  • Zero Cost: Ideal for large-scale initial screening of thousands of results.
  • Deterministic: Always gives the same result for the same text.
  • Explainable: High visibility into why a match occurred via result flags.

Weaknesses

  • Style Over Substance: Flags “Helpful” style as a jailbreak even if the model is refusing (e.g., “Step 1: Understand why I cannot help you”).
  • Misses Subtle Refusals: Often fails to catch nuanced hedging or “refuse-then-comply” patterns.
  • High False Positive Rate: Keyword matching detects response style (helpful, step-by-step), not semantic harm.

The LLM Judge uses a sophisticated model (typically Gemini 1.5 Pro or Flash) to read the prompt and response and make a semantic judgment.

How it Works

  • The judge is provided with a rubric defining Compliance, Refusal, and Partial.
  • It reasons about the intent of the response rather than just keywords.

Strengths

  • Semantic Understanding: Can detect when a model is following the spirit of an attack vs. just the format.
  • Handles Hedging: Excellent at correctly classifying “Refuse-then-comply” patterns.
  • Context Aware: Knows the difference between a real jailbreak and a discussion about jailbreaks.

Weaknesses

  • API Latency: Evaluation of a full benchmark can take minutes or hours.
  • Financial Cost: Costs scale linearly with benchmark size.
  • Recursive Risk: Using an LLM to judge an LLM can sometimes introduce systematic biases (e.g., preference for longer responses).

4. Decision Guidance: Which to Use?

ScenarioRecommended Grader
Rapid IterationCode Detector (Screening)
Large-Scale Trends (10K+ traces)Code Detector + 10% LLM Judge Validation
Standard Benchmark RunsLLM Judge (Single)
Final Research Findings / PapersLLM Judge (Single) or Consensus
Calibrating new attack classesHuman-Verified Consensus

5. Known Issues & Lessons Learned

Before relying on automated grading, researchers should be aware of these findings:

  • Unvalidated Heuristics: Heuristics must be manually spot-checked on every new model family.
  • Disclaimers ≠ Refusals: A response can contain a disclaimer and still be a 100% successful jailbreak.
  • Keyword False Positives: Keyword matching often detects helpfulness rather than harm.