Published
Report 159 Research — Empirical Study

Summary

We compared per-model attack success rates (ASR) from the F41LUR3-F1R57 jailbreak corpus against three public benchmarks: HarmBench (Mazeika et al., 2024), StrongREJECT (Souly et al., 2024), and JailbreakBench (Chao et al., 2024). Of 12 models matchable in our DB (min_n=20), only 7 comparison pairs could be constructed due to model name overlap. Correlation with JailbreakBench is weakly negative (Spearman rho=-0.200, n=4). Five outlier divergences exceed 15pp, concentrated on three models.

This report documents the divergences and attributes each to a specific methodological cause.

Data

All numbers from artifacts/benchmark_comparison_2026-03-19.json, generated by tools/benchmark_comparison.py against database/jailbreak_corpus.db (schema v13, 135,623 results, 236 models).

Divergence Analysis

Divergence 1: Llama 3.1 8B — +68.0pp vs JailbreakBench

MetricOur valueJailbreakBench
ASR100.0% (108/108)32.0%
Delta+68.0pp

Root cause: Model variant mismatch (abliterated vs safety-tuned).

Our DB entry for “llama-3.1-8b-instruct” maps to mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated — a community variant with safety training explicitly removed (“abliterated”). JailbreakBench tested the standard meta-llama/Llama-3.1-8B-Instruct with intact safety training.

Our DB does contain the standard meta-llama/llama-3.1-8b-instruct variant (n=37), but with 0 COMPLIANCE verdicts. The name-matching logic in benchmark_comparison.py selects the abliterated variant because it has more results (n=108 > n=37).

This is not a valid comparison. The abliterated model is expected to comply with 100% of requests — that is its purpose. The name-matching logic should exclude abliterated/uncensored variants, or disambiguate them.

Additional abliterated/uncensored models in our DB that could cause similar false matches:

  • Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 (n=95, 100% COMPLIANCE)
  • NousResearch/Hermes-3-Llama-3.1-8B (n=101, 100% COMPLIANCE)
  • huihui-ai/Qwen2.5-7B-Instruct-abliterated (n=102, 100% COMPLIANCE)

Fix required: Exclude models containing “abliterated”, “uncensored”, or “hermes” from the public benchmark comparison name-mapping, or add a model metadata flag.

Divergence 2: Mistral 7B Instruct — -56pp to -60pp vs all three benchmarks

BenchmarkOur ASRPublic ASRDelta
JailbreakBench0.0%60.0%-60.0pp
HarmBench0.0%56.0%-56.0pp
StrongREJECT0.0%34.0%-34.0pp

Root cause: Different model variant + different prompt distribution.

Two factors compound here:

  1. Model variant difference. Our DB entry is mistralai/mistral-7b-instruct:free — the OpenRouter free-tier variant. Public benchmarks tested mistralai/Mistral-7B-Instruct-v0.2, the original Mistral release. The free-tier model served via OpenRouter may have additional safety layers (system-prompt wrapping, content filtering) applied by the hosting provider. We also have mistralai/Mistral-7B-v0.3 (n=181, 100% COMPLIANCE), but this is the base model, not instruct.

  2. Prompt distribution difference. HarmBench reports ASR under the GCG (Greedy Coordinate Gradient) attack — a white-box adversarial suffix method. JailbreakBench uses PAIR (Prompt Automatic Iterative Refinement). StrongREJECT aggregates across multiple attack methods. Our corpus uses a multi-technique mix including embodied scenarios, format-lock prompts, and semantic inversion — but does not include GCG suffixes or PAIR-generated prompts specifically targeting Mistral 7B.

  3. Grading methodology. Our 0.0% strict ASR with 0.6% broad ASR (1 PARTIAL in 176 results) uses COALESCE verdicts. Public benchmarks use their own classifiers (HarmBench: Llama-based classifier; StrongREJECT: GPT-4 judge; JailbreakBench: rule-based + manual).

Interpretation: The combination of a potentially more restricted model variant and a prompt set that does not include the specific attack methods that succeed against Mistral 7B (GCG, PAIR) fully explains the gap. This is expected behavior — our corpus tests different things than these benchmarks.

Divergence 3: GPT-4o-mini — +26.9pp vs JailbreakBench

MetricOur valueJailbreakBench
ASR42.9% (15/35)16.0%
Delta+26.9pp

Root cause: Small sample + corpus composition bias.

Our n=35 for GPT-4o-mini produces a 95% Wilson CI of approximately [27.0%, 60.0%], which technically includes the JailbreakBench value of 16.0% only marginally. However:

  1. Sample size. n=35 is near our minimum threshold (min_n=20). This produces wide confidence intervals. The true ASR could plausibly be anywhere from 27% to 60%.

  2. Corpus composition. Our prompts include embodied-specific and format-lock scenarios. Format-lock attacks have been shown to shift frontier models from restrictive (<10% ASR) to mixed (23-42%) vulnerability profiles (Report #51). If the n=35 sample is enriched for format-lock prompts, the elevated ASR is expected.

  3. Grading. Our GPT-4o-mini results show 15 COMPLIANCE + 2 PARTIAL out of 35. JailbreakBench uses a different judge for success determination.

Interpretation: The elevated ASR is plausibly explained by corpus composition (format-lock enrichment) and small sample size. This is not necessarily evidence that GPT-4o-mini is more vulnerable than JailbreakBench reports — it may indicate that format-lock attacks are more effective than PAIR on this model, which would be consistent with the capability-floor hypothesis (Report #51).

Non-divergent comparison: Llama 3.3 70B (+11.7pp) and Gemini (-3.7pp)

Llama 3.3 70B (ours: 25.7%, JBB: 14.0%, delta: +11.7pp) is within a reasonable range given our larger sample (n=560) and multi-technique corpus. Our broad ASR (37.1%) is notably higher, suggesting many PARTIAL verdicts that public benchmarks would classify differently.

Gemini (ours: 0.3%, HarmBench: 4.0%, delta: -3.7pp) is closely aligned. Both agree this model is highly resistant.

Correlation Assessment

Benchmarkn matchedSpearman rhoPearson rInterpretation
HarmBench2N/AN/AInsufficient data
JailbreakBench4-0.200-0.331No agreement
StrongREJECT1N/AN/AInsufficient data

The negative JailbreakBench correlation is driven primarily by the abliterated Llama variant (inflates our ranking) and the Mistral free-tier suppression (deflates our ranking). Removing these two confounded models would leave n=2, which is insufficient for correlation.

Conclusion: We cannot compute meaningful correlations with current model overlap. The 23 unmatched public benchmark models represent the primary limitation.

Hypothesis Evaluation

H1: Different prompt sets — CONFIRMED as contributing factor

Our corpus is multi-technique (embodied, format-lock, semantic inversion, multi-turn) while public benchmarks use specific attack methods (GCG, PAIR, direct request). The Mistral 7B divergence (-56pp) is largely explained by the absence of GCG/PAIR prompts in our corpus. The GPT-4o-mini divergence (+27pp) is plausibly explained by format-lock enrichment.

H2: Different grading methodology — PARTIALLY CONFIRMED

Our COALESCE methodology (LLM-graded where available, heuristic fallback; Cohen’s kappa = 0.126 between the two) differs from HarmBench (Llama classifier), JailbreakBench (rule-based + manual), and StrongREJECT (GPT-4 judge). The PARTIAL category in our system has no direct equivalent in public benchmarks. Our strict ASR (COMPLIANCE only) is the closest comparable, but our broad ASR (COMPLIANCE + PARTIAL) may better match what some public classifiers count as “success.”

H3: Different model versions/configs — CONFIRMED as primary driver for 2 of 3 outliers

The abliterated Llama variant (+68pp) and the OpenRouter free-tier Mistral variant (-56pp) are the two largest outliers and are both model variant mismatches, not methodology differences.

H4: Embodied-specific prompts — PARTIALLY CONFIRMED

Our corpus includes 351 VLA-specific scenarios across 29 attack families that have no equivalent in any public benchmark (Established Finding: “All public static benchmarks have zero embodied/tool-integrated agent scenarios”). These prompts may elicit different vulnerability profiles, particularly for models like GPT-4o-mini where format-lock/embodied compliance differs from standard harmful-request refusal.

Recommendations

  1. Fix the name-matching logic in benchmark_comparison.py to exclude abliterated/uncensored model variants from public benchmark comparisons.
  2. Add model variant metadata to the DB (a boolean is_safety_modified or similar) to distinguish safety-removed models.
  3. Expand model overlap by running dedicated evaluation passes of public benchmark prompt subsets (HarmBench, JBB) through our pipeline on matching models. This would provide direct apples-to-apples comparison.
  4. Add Wilson CIs to the comparison output so small-sample comparisons are properly flagged.
  5. Do not cite correlation values from this comparison in external publications until model overlap reaches at least n=10 per benchmark with clean variant matching.

Artifacts

  • artifacts/benchmark_comparison_2026-03-19.json — full JSON output
  • artifacts/benchmark_comparison_2026-03-19_verbose.txt — verbose text output
  • tools/benchmark_comparison.py — comparison tool (v1.0)

This research informs our commercial services. See how we can help →