Published
Report 330 Technical Analysis

Executive Summary

This report audits the grading infrastructure across the entire jailbreak corpus (135,623 results, 236 models). It quantifies grading coverage, inter-method agreement, calibration set adequacy, and grader quality.

Key findings:

  • 51,160 results (38.3%) have LLM verdicts. Non-OBLITERATUS coverage is 70.6% (8,814/12,485). 3,671 non-OBLITERATUS results lack any LLM verdict.
  • 42,628 results are dual-graded (both LLM and heuristic). Raw exact agreement: 91.7%. Binary Cohen’s kappa: 0.834.
  • Gold-standard calibration set: 76 samples — insufficient for CCS claims about 5-category FLIP classification reliability.
  • 50 qwen3:1.7b verdicts remain and should be regraded (known 15% accuracy).

1. Grading Coverage

1.1 Corpus-Level

CategoryTotalLLM-Graded% GradedUngraded
OBLITERATUS120,93142,34635.0%78,585
Non-OBLITERATUS12,4858,81470.6%3,671
Total133,41651,16038.3%82,256

The OBLITERATUS ungraded count is not a priority — these are abliterated model results with predictable COMPLIANCE verdicts. The non-OBLITERATUS backlog (3,671) is the actionable gap, with 67.7% in benchmark_traces and 18.9% in AdvBench.

1.2 LLM Verdict Distribution (Non-OBLITERATUS, n=8,814)

VerdictCount%
REFUSAL1,99322.6%
NOT_GRADEABLE1,67719.0%
BENIGN_QUERY1,64618.7%
COMPLIANCE1,21313.8%
ERROR97811.1%
PARTIAL7288.3%
HALLUCINATION_REFUSAL5296.0%

30.7% of non-OBLITERATUS LLM verdicts are non-evaluable (NOT_GRADEABLE + ERROR + PARSE_ERROR + INFRASTRUCTURE_ERROR).


2. Grader Agreement

2.1 Dual-Graded Results

Raw exact agreement: 91.7% (39,098/42,628). Binary Cohen’s kappa: 0.834 (substantial agreement).

Important context: The canonical kappa of 0.126 reported in CANONICAL_METRICS.md was computed on n=1,989 independently dual-graded non-OBLITERATUS results. The 0.834 figure here includes the OBLITERATUS corpus where both graders trivially agree. The canonical kappa of 0.126 on non-OBLITERATUS data is the correct figure for CCS claims.

2.2 Key Asymmetry

The heuristic classifier calls COMPLIANCE on 1,164 results where the LLM calls REFUSAL (758) or HALLUCINATION_REFUSAL (406). This confirms that heuristic classifiers over-report compliance.

2.3 LLM Classifiers in Use

ClassifierCountNotes
obliteratus-import42,346Reliable for abliterated models
anthropic/claude-haiku-4.54,179Primary production grader
ollama (deepseek-r1:1.5b)1,258Local grading
qwen3:1.7b50Known 15% accuracy — regrade required

3. Calibration Set Status

Gold standard: 76 samples, 4 models, binary labels only (jailbreak/safe). 84.2% agreement with automated classifier [Wilson 95% CI: 74.7%, 90.6%].

Critical gap: There is no human-labeled gold standard for the 5-category FLIP taxonomy (COMPLIANCE/PARTIAL/REFUSAL/HR/BENIGN_QUERY) used in all CCS paper ASR claims. The paper can defensibly claim “low inter-method agreement” but cannot claim specific grader accuracy rates without a FLIP gold standard.

Grader Meta-Evaluation

Grader bias varies systematically by model:

  • nemotron-3-nano:30b: REFUSAL bias
  • devstral-small-2:24b: REFUSAL bias
  • On ambiguous cases: grader agreement collapses to kappa=0.204 vs 1.0 on obvious cases

4. Recommendations

  1. Regrade 50 qwen3:1.7b verdicts with deepseek-r1:1.5b
  2. Grade 694 AdvBench results — entire dataset has zero LLM verdicts
  3. Create 100-sample FLIP gold standard with human labels across all 5 verdict categories
  4. Restart weekly calibration pipeline
  5. Never use qwen3:1.7b for grading (15% accuracy)
  6. CCS paper should disclose absence of human-labeled FLIP gold standard as a limitation

Report #330 | F41LUR3-F1R57 Adversarial AI Research

This research informs our commercial services. See how we can help →