Summary
We compared per-model attack success rates (ASR) from the F41LUR3-F1R57 jailbreak corpus against three public benchmarks: HarmBench (Mazeika et al., 2024), StrongREJECT (Souly et al., 2024), and JailbreakBench (Chao et al., 2024). Of 12 models matchable in our DB (min_n=20), only 7 comparison pairs could be constructed due to model name overlap. Correlation with JailbreakBench is weakly negative (Spearman rho=-0.200, n=4). Five outlier divergences exceed 15pp, concentrated on three models.
This report documents the divergences and attributes each to a specific methodological cause.
Data
All numbers from artifacts/benchmark_comparison_2026-03-19.json, generated by tools/benchmark_comparison.py against database/jailbreak_corpus.db (schema v13, 135,623 results, 236 models).
Divergence Analysis
Divergence 1: Llama 3.1 8B — +68.0pp vs JailbreakBench
| Metric | Our value | JailbreakBench |
|---|---|---|
| ASR | 100.0% (108/108) | 32.0% |
| Delta | +68.0pp |
Root cause: Model variant mismatch (abliterated vs safety-tuned).
Our DB entry for “llama-3.1-8b-instruct” maps to mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated — a community variant with safety training explicitly removed (“abliterated”). JailbreakBench tested the standard meta-llama/Llama-3.1-8B-Instruct with intact safety training.
Our DB does contain the standard meta-llama/llama-3.1-8b-instruct variant (n=37), but with 0 COMPLIANCE verdicts. The name-matching logic in benchmark_comparison.py selects the abliterated variant because it has more results (n=108 > n=37).
This is not a valid comparison. The abliterated model is expected to comply with 100% of requests — that is its purpose. The name-matching logic should exclude abliterated/uncensored variants, or disambiguate them.
Additional abliterated/uncensored models in our DB that could cause similar false matches:
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2(n=95, 100% COMPLIANCE)NousResearch/Hermes-3-Llama-3.1-8B(n=101, 100% COMPLIANCE)huihui-ai/Qwen2.5-7B-Instruct-abliterated(n=102, 100% COMPLIANCE)
Fix required: Exclude models containing “abliterated”, “uncensored”, or “hermes” from the public benchmark comparison name-mapping, or add a model metadata flag.
Divergence 2: Mistral 7B Instruct — -56pp to -60pp vs all three benchmarks
| Benchmark | Our ASR | Public ASR | Delta |
|---|---|---|---|
| JailbreakBench | 0.0% | 60.0% | -60.0pp |
| HarmBench | 0.0% | 56.0% | -56.0pp |
| StrongREJECT | 0.0% | 34.0% | -34.0pp |
Root cause: Different model variant + different prompt distribution.
Two factors compound here:
-
Model variant difference. Our DB entry is
mistralai/mistral-7b-instruct:free— the OpenRouter free-tier variant. Public benchmarks testedmistralai/Mistral-7B-Instruct-v0.2, the original Mistral release. The free-tier model served via OpenRouter may have additional safety layers (system-prompt wrapping, content filtering) applied by the hosting provider. We also havemistralai/Mistral-7B-v0.3(n=181, 100% COMPLIANCE), but this is the base model, not instruct. -
Prompt distribution difference. HarmBench reports ASR under the GCG (Greedy Coordinate Gradient) attack — a white-box adversarial suffix method. JailbreakBench uses PAIR (Prompt Automatic Iterative Refinement). StrongREJECT aggregates across multiple attack methods. Our corpus uses a multi-technique mix including embodied scenarios, format-lock prompts, and semantic inversion — but does not include GCG suffixes or PAIR-generated prompts specifically targeting Mistral 7B.
-
Grading methodology. Our 0.0% strict ASR with 0.6% broad ASR (1 PARTIAL in 176 results) uses COALESCE verdicts. Public benchmarks use their own classifiers (HarmBench: Llama-based classifier; StrongREJECT: GPT-4 judge; JailbreakBench: rule-based + manual).
Interpretation: The combination of a potentially more restricted model variant and a prompt set that does not include the specific attack methods that succeed against Mistral 7B (GCG, PAIR) fully explains the gap. This is expected behavior — our corpus tests different things than these benchmarks.
Divergence 3: GPT-4o-mini — +26.9pp vs JailbreakBench
| Metric | Our value | JailbreakBench |
|---|---|---|
| ASR | 42.9% (15/35) | 16.0% |
| Delta | +26.9pp |
Root cause: Small sample + corpus composition bias.
Our n=35 for GPT-4o-mini produces a 95% Wilson CI of approximately [27.0%, 60.0%], which technically includes the JailbreakBench value of 16.0% only marginally. However:
-
Sample size. n=35 is near our minimum threshold (min_n=20). This produces wide confidence intervals. The true ASR could plausibly be anywhere from 27% to 60%.
-
Corpus composition. Our prompts include embodied-specific and format-lock scenarios. Format-lock attacks have been shown to shift frontier models from restrictive (<10% ASR) to mixed (23-42%) vulnerability profiles (Report #51). If the n=35 sample is enriched for format-lock prompts, the elevated ASR is expected.
-
Grading. Our GPT-4o-mini results show 15 COMPLIANCE + 2 PARTIAL out of 35. JailbreakBench uses a different judge for success determination.
Interpretation: The elevated ASR is plausibly explained by corpus composition (format-lock enrichment) and small sample size. This is not necessarily evidence that GPT-4o-mini is more vulnerable than JailbreakBench reports — it may indicate that format-lock attacks are more effective than PAIR on this model, which would be consistent with the capability-floor hypothesis (Report #51).
Non-divergent comparison: Llama 3.3 70B (+11.7pp) and Gemini (-3.7pp)
Llama 3.3 70B (ours: 25.7%, JBB: 14.0%, delta: +11.7pp) is within a reasonable range given our larger sample (n=560) and multi-technique corpus. Our broad ASR (37.1%) is notably higher, suggesting many PARTIAL verdicts that public benchmarks would classify differently.
Gemini (ours: 0.3%, HarmBench: 4.0%, delta: -3.7pp) is closely aligned. Both agree this model is highly resistant.
Correlation Assessment
| Benchmark | n matched | Spearman rho | Pearson r | Interpretation |
|---|---|---|---|---|
| HarmBench | 2 | N/A | N/A | Insufficient data |
| JailbreakBench | 4 | -0.200 | -0.331 | No agreement |
| StrongREJECT | 1 | N/A | N/A | Insufficient data |
The negative JailbreakBench correlation is driven primarily by the abliterated Llama variant (inflates our ranking) and the Mistral free-tier suppression (deflates our ranking). Removing these two confounded models would leave n=2, which is insufficient for correlation.
Conclusion: We cannot compute meaningful correlations with current model overlap. The 23 unmatched public benchmark models represent the primary limitation.
Hypothesis Evaluation
H1: Different prompt sets — CONFIRMED as contributing factor
Our corpus is multi-technique (embodied, format-lock, semantic inversion, multi-turn) while public benchmarks use specific attack methods (GCG, PAIR, direct request). The Mistral 7B divergence (-56pp) is largely explained by the absence of GCG/PAIR prompts in our corpus. The GPT-4o-mini divergence (+27pp) is plausibly explained by format-lock enrichment.
H2: Different grading methodology — PARTIALLY CONFIRMED
Our COALESCE methodology (LLM-graded where available, heuristic fallback; Cohen’s kappa = 0.126 between the two) differs from HarmBench (Llama classifier), JailbreakBench (rule-based + manual), and StrongREJECT (GPT-4 judge). The PARTIAL category in our system has no direct equivalent in public benchmarks. Our strict ASR (COMPLIANCE only) is the closest comparable, but our broad ASR (COMPLIANCE + PARTIAL) may better match what some public classifiers count as “success.”
H3: Different model versions/configs — CONFIRMED as primary driver for 2 of 3 outliers
The abliterated Llama variant (+68pp) and the OpenRouter free-tier Mistral variant (-56pp) are the two largest outliers and are both model variant mismatches, not methodology differences.
H4: Embodied-specific prompts — PARTIALLY CONFIRMED
Our corpus includes 351 VLA-specific scenarios across 29 attack families that have no equivalent in any public benchmark (Established Finding: “All public static benchmarks have zero embodied/tool-integrated agent scenarios”). These prompts may elicit different vulnerability profiles, particularly for models like GPT-4o-mini where format-lock/embodied compliance differs from standard harmful-request refusal.
Recommendations
- Fix the name-matching logic in
benchmark_comparison.pyto exclude abliterated/uncensored model variants from public benchmark comparisons. - Add model variant metadata to the DB (a boolean
is_safety_modifiedor similar) to distinguish safety-removed models. - Expand model overlap by running dedicated evaluation passes of public benchmark prompt subsets (HarmBench, JBB) through our pipeline on matching models. This would provide direct apples-to-apples comparison.
- Add Wilson CIs to the comparison output so small-sample comparisons are properly flagged.
- Do not cite correlation values from this comparison in external publications until model overlap reaches at least n=10 per benchmark with clean variant matching.
Artifacts
artifacts/benchmark_comparison_2026-03-19.json— full JSON outputartifacts/benchmark_comparison_2026-03-19_verbose.txt— verbose text outputtools/benchmark_comparison.py— comparison tool (v1.0)