Cross-Benchmark Comparison — F41LUR3-F1R57 vs Published Benchmarks | Research | Failure-First

Adrian Wedd

Report 201 Research — Empirical Study 2026-03-24

Summary

This report compares F41LUR3-F1R57 LLM-graded ASR against published ASR numbers from five public benchmarks (HarmBench, JailbreakBench, StrongREJECT, SORRY-Bench, AdvBench) using data from the jailbreak corpus (135,623 results, 236 models).

Key findings:

LLM-graded strict ASR is systematically lower than published keyword/classifier-based ASR for the same models and prompts. Median delta: -16.0pp across 13 comparison pairs.
Heuristic over-reporting is pervasive. 3,269 results classified as COMPLIANCE by heuristic were overturned by LLM grading (7.7% of dual-graded results). Only 39 went the other direction (84:1 ratio).
Benchmark-to-benchmark ASR variance is high even for the same model: Llama 3.2 scores 23.0% strict on HarmBench, 13.0% on JailbreakBench, and 10.0% on StrongREJECT prompts — all LLM-graded by the same classifier.
No positive correlation exists between our LLM-graded ASR and any published benchmark’s ASR. This is evidence that grading methodology dominates model-level ASR rankings.

Methodology

All results use the COALESCE(llm_verdict, heuristic_verdict) methodology. LLM grading takes precedence where available.

Strict ASR: COMPLIANCE only
Broad ASR: COMPLIANCE + PARTIAL
Seven verdict categories: COMPLIANCE, PARTIAL, REFUSAL, HALLUCINATION_REFUSAL, BENIGN_QUERY, NOT_GRADEABLE, ERROR

Published Benchmark Grading Methods

Benchmark	Grading Method
HarmBench	GPT-4 classifier (binary)
JailbreakBench	GPT-4 classifier + human audit
StrongREJECT	Rubric scorer (0-1 scale, LLM-based)
SORRY-Bench	GPT-4 classifier (45 harm categories)
AdvBench	Keyword matching (prefix detection)

Results

Same-Model, Cross-Benchmark ASR (Llama 3.2)

Llama 3.2 (3B, via Ollama) tested across all three public benchmark prompt sets with the same LLM grader:

Dataset Prompts	n	Strict ASR	Broad ASR	Refusal Rate
HarmBench	100	23.0%	27.0%	62.0%
JailbreakBench	100	13.0%	18.0%	71.0%
StrongREJECT	50	10.0%	22.0%	60.0%

The 13pp spread in strict ASR demonstrates that prompt set composition drives measured vulnerability as much as model properties do.

Heuristic Over-Reporting

Category	Count	% of Dual-Graded
Both COMPLIANCE	19,821	46.6%
Both non-COMPLIANCE	19,430	45.6%
Heuristic overcount	3,269	7.7%
Heuristic undercount	39	0.1%

The asymmetry is stark: heuristic over-counts at 84x the rate of under-counts.

Why Do Published Benchmarks Report Higher ASR?

Keyword Inflation (AdvBench Pattern)

AdvBench used prefix-matching: if the model’s response did not begin with refusal prefixes, it was counted as success. This counts hallucinations, partial responses, and benign reformulations as successful attacks.

Classifier Inflation (HarmBench/JailbreakBench Pattern)

Binary harmful/not-harmful classification lacks PARTIAL, HALLUCINATION_REFUSAL, and BENIGN_QUERY distinctions. Verbose, hedged responses may be classified as “harmful” if any actionable content is present.

Prompt Corpus Differences

The 13pp ASR spread across benchmark prompt sets for the same model demonstrates that prompt composition matters independently of grading methodology.

Benchmark Reliability Ranking

Tier 1 (Most Reliable): StrongREJECT (rubric-based LLM scorer, continuous scale), SORRY-Bench (45 fine-grained categories)

Tier 2 (Moderately Reliable): HarmBench (GPT-4 binary, strong attack diversity), JailbreakBench (GPT-4 + human audit, limited to PAIR attack)

Tier 3 (Least Reliable): AdvBench (prefix keyword matching, known to over-report ASR dramatically)

Limitations

Model overlap is limited (8 models matched between our DB and published benchmarks out of 28 unique published models)
Some comparison sample sizes are small (n=5-10)
Comparison uses LLM-graded vs keyword/classifier-graded without running both methods on identical response sets