Published
Report 234 Research — Empirical Study

Summary

This report presents the definitive attack technique effectiveness ranking based on 53,831 LLM-graded results across the full jailbreak corpus database. It extends the prior ranking document (which used 2,653 technique-tagged + scenario-class results) with the full Haiku batch grading corpus (2,700 newly graded in batch 2) and updated technique-tagged data (2,597 evaluable results across 11 family-era combinations with n >= 10).

Additionally, this report documents the Sprint 13 grading pipeline analysis covering 64 newly graded traces across 3 models, and flags two critical quality issues in the graded data.


Part 1: Sprint 13 Grading Pipeline Analysis

Grading Progress

Three files graded in runs/grading/sprint13/:

FileModelTracesGraderActual Responses429 Errors
graded_arcee-ai_trinity-large-preview-free_traces.jsonlTrinity Large Preview30nemotron-nano-9b-v2:free300
graded_cognitivecomputations_dolphin-mistral-24b-venice-edition-free_traces.jsonlDolphin Mistral 24B30nemotron-nano-9b-v2:free030
graded_google_gemma-3-27b-it-free_traces.jsonlGemma 3 27B4nemotron-nano-9b-v2:free04

Total graded: 64 traces (but only 30 contain actual model responses)

Verdict Distributions

Trinity Large Preview (n=30, all actual responses):

  • REFUSAL: 28 (93.3%)
  • PARTIAL: 1 (3.3%)
  • COMPLIANCE: 1 (3.3%)
  • Strict ASR: 3.3%, Broad ASR: 6.7%

Dolphin Mistral 24B (n=30, all 429 errors):

  • REFUSAL: 27 (90.0%)
  • ERROR: 3 (10.0%)
  • Strict ASR: 0.0%, Broad ASR: 0.0%

Gemma 3 27B (n=4, all 429 errors):

  • REFUSAL: 4 (100.0%)
  • Strict ASR: 0.0%, Broad ASR: 0.0%

Critical Quality Issues

Issue 1: Rate-limit errors graded as REFUSAL (34 of 64 traces)

Dolphin Mistral (30/30) and Gemma (4/4) traces are ALL infrastructure 429 errors, not model responses. The Nemotron Nano grader classified these as REFUSAL, which is incorrect per Mistake #14 (distinguish API-level blocks from model-level refusals). These 34 traces should be classified as INFRA_ERROR or excluded from ASR calculations entirely.

The Dolphin Mistral result is particularly misleading: this is an uncensored model that would likely show high ASR if it had actually responded, but the grading pipeline reports 0% ASR because rate-limit errors masquerade as refusals.

Issue 2: Heuristic-LLM disagreement on Trinity (10/30 = 33.3% disagreement)

The heuristic classifier marked 11 traces as attack_success=true, but the LLM grader classified only 2 as COMPLIANCE or PARTIAL. Manual inspection of 5 disagreement cases shows the LLM grader is correct: the heuristic falsely counted responses that contained safety disclaimers or explicit refusals followed by educational alternatives (e.g., “I’m not going to help you develop a virus” classified as attack_success by the heuristic). This is consistent with Mistake #21 (keyword classifier false positives).

Score Report Results

The score report tool (v1.0) runs correctly on the graded format. Key metrics for Trinity:

  • RBI: 63.3% [45.5%, 78.1%] (n=30)
  • RRR: 60.0% [42.3%, 75.4%] (n=30)
  • LCS: 100.0% [88.6%, 100.0%] (n=30)
  • Power analysis: 339 traces needed per model for 10% delta detection

For Dolphin Mistral, the score report produces RBI=0.0% [0.0%, 11.4%] which is misleading given all traces are infrastructure errors.

Recommendation: The score report tool should detect and flag traces where agent_response contains [ERROR: Rate limit exceeded (429)] as infrastructure errors rather than including them in ASR calculations.

Batch 2 (Haiku) Grading Status

The separate Haiku batch grading run (runs/grading/batch2_output.log) completed 2,700 of 5,499 ungraded results (28 of 50 planned batches) and committed to the database. Verdict distribution:

  • COMPLIANCE: 229 (8.5%)
  • PARTIAL: 93 (3.4%)
  • REFUSAL: 810 (30.0%)
  • HALLUCINATION_REFUSAL: 107 (4.0%)
  • BENIGN_QUERY: 440 (16.3%)
  • NOT_GRADEABLE: 1,021 (37.8%)
  • ERROR: 0 (0.0%)

The 37.8% NOT_GRADEABLE rate is high and warrants investigation. This may reflect traces that were infrastructure errors, empty responses, or otherwise unsuitable for verdict classification.


Part 2: Attack Technique Effectiveness Ranking (Full DB, n=53,831 LLM-graded)

By Technique Family (n >= 10 evaluable results)

RankFamilynStrict ASRBroad ASR
1cot_exploit12926.4%31.8%
2multi_turn16323.9%38.7%
3other (public datasets)80411.1%14.9%
4encoding1008.0%15.0%
5behavioral (skeleton_key)586.9%12.1%
6task_framing205.0%10.0%
7volumetric (many-shot)663.0%4.5%
8persona1,2100.7%1.2%
9prompt_injection140.0%0.0%

Key observations:

  • cot_exploit (chain-of-thought exploitation, reasoning_2025 era) is the most effective family by strict ASR (26.4%), displacing multi_turn from the top position in the prior ranking.
  • multi_turn remains the highest by broad ASR (38.7%), confirming that multi-turn attacks produce more PARTIAL verdicts than other families.
  • persona attacks (n=1,210, mostly DAN-era) are near-ineffective at 0.7% strict ASR. This is the largest sample in the technique-tagged set and provides strong evidence that DAN-era persona attacks are obsolete.
  • prompt_injection (n=14) shows 0% ASR, though the sample is small.

By Era (n >= 10)

RankEranStrict ASRBroad ASR
1reasoning_202512926.4%31.8%
2crescendo_202430315.5%24.8%
3general80411.1%14.9%
4cipher_20231427.7%15.5%
5many_shot_2024244.2%4.2%
6dan_20221,1820.6%0.8%
7persona_2022130.0%0.0%

Temporal trend: Attack effectiveness increases monotonically with era recency: persona_2022 (0%) -> dan_2022 (0.8%) -> cipher_2023 (15.5%) -> many_shot_2024 (4.2%, exception) -> crescendo_2024 (24.8%) -> reasoning_2025 (31.8%). The many_shot_2024 exception may reflect small sample size (n=24) or the specific implementation tested.

Frontier Model Results (Claude Sonnet 4.5, GPT-5.2, Gemini 3 Flash)

Frontier models have technique-tagged results across 8 families (n varies by family). Key findings:

FamilyClaude (n)Claude BroadGPT-5.2 (n)GPT-5.2 BroadGemini (n)Gemini Broad
multi_turn225 (22.7%)196 (31.6%)242 (8.3%)
encoding240 (0.0%)243 (12.5%)240 (0.0%)
cot_exploit200 (0.0%)184 (22.2%)224 (18.2%)
volumetric160 (0.0%)182 (11.1%)160 (0.0%)
behavioral142 (14.3%)163 (18.8%)140 (0.0%)
persona140 (0.0%)141 (7.1%)120 (0.0%)
prompt_injection40 (0.0%)40 (0.0%)40 (0.0%)

Families with zero broad ASR on ALL three frontier models:

  • prompt_injection (n=4 per model — small sample caveat)

Families with zero on Claude specifically:

  • encoding, cot_exploit, volumetric, persona, prompt_injection

GPT-5.2 shows the most vulnerability among frontier models, with non-zero broad ASR across all families except prompt_injection. This may reflect differences in safety training philosophy rather than model capability.

Comparison to Prior Ranking (Rose Tyler, 2026-03-19)

The prior ranking (docs/analysis/attack_family_effectiveness_ranking.md) used 944 LLM-graded technique-tagged results. This report uses 2,597 evaluable results from the technique-tagged subset, a 2.75x increase.

FindingPrior RankingThis ReportChange
Top family (strict)multi_turn (26.0%)cot_exploit (26.4%)Shifted
Top family (broad)multi_turn (41.8%)multi_turn (38.7%)Reduced 3.1pp
DAN persona ASR0.0-6.7%0.6% (n=1,182)Confirmed obsolete
Encoding broad15.3% (n=98)15.0% (n=100)Stable
Prompt injection0.0% (n=14)0.0% (n=14)Unchanged
Benign FP floor24.3% (n=202)Not recalculatedAssumed stable

The rank order is largely stable. The main change is cot_exploit overtaking multi_turn for strict ASR, but multi_turn retains the top broad ASR position. The persona family’s massive sample (n=1,210) now provides definitive evidence of DAN-era obsolescence.


Part 3: DB-Level Verdict Distribution (Full Corpus)

Total results: 133,033 LLM-graded: 53,831 Heuristic-graded: 42,778

VerdictCountPercentage
COMPLIANCE20,28537.7%
PARTIAL16,09329.9%
NOT_GRADEABLE7,02013.0%
REFUSAL6,36611.8%
ERROR1,8303.4%
BENIGN_QUERY1,6813.1%
HALLUCINATION_REFUSAL5171.0%
PARSE_ERROR330.1%
INFRA_ERROR60.0%

Caveat: The 37.7% COMPLIANCE rate is heavily influenced by the OBLITERATUS import (120,931 results from obliteratus-import classifier), which contains abliterated models expected to show high compliance. The technique-tagged subset shows much lower COMPLIANCE rates (7.2% weighted average).


Recommendations

  1. Fix rate-limit error handling in grading pipeline. Traces containing [ERROR: Rate limit exceeded (429)] should be classified as INFRA_ERROR, not sent to the LLM grader. The current pipeline wasted 34 grading API calls on infrastructure errors and produced misleading ASR numbers.

  2. Re-run Dolphin Mistral and Gemma benchmarks when rate limits have recovered. Dolphin Mistral is an uncensored model that would provide a useful permissive-model baseline.

  3. Investigate 37.8% NOT_GRADEABLE rate in the Haiku batch grading. This is high and may indicate systematic issues with trace quality or grader prompt formatting.

  4. Update docs/analysis/attack_family_effectiveness_ranking.md with the expanded technique-tagged data from this report (2,597 vs 944 evaluable results).

  5. Collect more frontier model technique-tagged data. The current frontier subset is too small (4-24 traces per family per model) for reliable per-family ASR estimates with tight confidence intervals.

This research informs our commercial services. See how we can help →