Published
Report 212 Research — Empirical Study

Summary

Report #201 identified 520 AdvBench prompts with 0 results and 15,437 underutilized public prompts. This report provides a comprehensive audit of all 15 public datasets imported into the jailbreak corpus database, quantifying prompt counts, result coverage, model breadth, and priority for benchmark runs.

Key finding: 6 of 15 public datasets have zero results. The 4 P1 (Priority 1) datasets collectively hold 1,253 prompts but only 560 results (44.7% coverage), concentrated on just 2-3 models each. Closing the AdvBench gap alone would add the most-cited benchmark in the jailbreak literature to our empirical evidence base.


Complete Coverage Table

#DatasetPriorityPromptsResultsCoverage %Models TestedStatus
1SORRY-BenchP29,44660.1%1CRITICAL GAP
2BEAVERTAILSP33,43220.1%1LOW PRIORITY (large)
3DAN-In-The-WildP21,4051,16482.8%3GOOD
4obliteratus_prompt_corpus1,02400.0%0INTERNAL (separate workflow)
5WildJailbreakP31,000959.5%3LOW (sampled subset)
6AdvBenchP152000.0%0CRITICAL GAP
7ForbiddenQuestionsP239041.0%1GAP
8HarmBenchP132020363.4%3PARTIAL
9StrongREJECTP131310031.9%2PARTIAL
10HEx-PHIP229031.0%1GAP
11ToxicChatP211300.0%0GAP
12JailbreakBenchP1100257257.0%*2GOOD (multi-model)
13SimpleSafetyTestsP210000.0%0GAP
14TDC2023-RedTeamingP210000.0%0GAP
15LLM-Finetuning-SafetyP21700.0%0GAP (tiny dataset)

Totals: 18,570 public prompts, 1,834 results, 9.9% aggregate coverage.

*JailbreakBench coverage >100% because multiple models were tested per prompt (257 results / 100 prompts = ~2.6 models avg).


Verdict Breakdown (Datasets with Results)

DatasetCOMPLIANCEPARTIALREFUSALOTHERTotal
DAN-In-The-Wild7 (0.6%)3 (0.3%)988 (84.9%)166 (14.3%)1,164
JailbreakBench13 (5.1%)5 (1.9%)222 (86.4%)17 (6.6%)257
HarmBench23 (11.3%)4 (2.0%)138 (68.0%)38 (18.7%)203
StrongREJECT5 (5.0%)6 (6.0%)80 (80.0%)9 (9.0%)100
WildJailbreak1 (1.1%)2 (2.1%)88 (92.6%)4 (4.2%)95
SORRY-Bench3 (50.0%)2 (33.3%)1 (16.7%)06

Note: Low-result datasets (BEAVERTAILS=2, ForbiddenQuestions=4, HEx-PHI=3) omitted from table — sample sizes too small for meaningful breakdown.


Import Verification

All public dataset imports verified complete via dry-run:

DatasetExpectedImportedStatus
AdvBench520520COMPLETE
JailbreakBench100100COMPLETE
HarmBench320320COMPLETE
StrongREJECT313313COMPLETE
SORRY-Bench9,4469,446COMPLETE (full expanded set)
DAN-In-The-Wild1,4051,405COMPLETE
WildJailbreak1,0001,000COMPLETE (sampled subset of 262K)
BEAVERTAILS3,4323,432COMPLETE (sampled subset of 330K)
ForbiddenQuestions390390COMPLETE
HEx-PHI290290COMPLETE
ToxicChat113113COMPLETE
SimpleSafetyTests100100COMPLETE
TDC2023-RedTeaming100100COMPLETE
LLM-Finetuning-Safety1717COMPLETE

All 14 public datasets fully imported. No re-imports needed. The gap is in benchmark execution (results), not data ingestion (prompts).


Datasets Needing Benchmark Runs

Tier 1: CCS-Critical (Run Before April 22 Abstract Registration)

These datasets are cited by virtually every jailbreak paper reviewer. Having zero results on AdvBench is indefensible in peer review.

DatasetPrompts to RunModels NeededEst. Cost (OpenRouter)Est. TimeJustification
AdvBench52010 frontier + 10 mid-tier~28(freetier:2-8 (free tier: 0)2-4 hoursMost-cited jailbreak benchmark (Zou et al. 2023). 0 results = peer review red flag.
HarmBench320 (gap: ~117 prompts x new models)+7 models (3 already tested)~$1-41-2 hoursSecond most-cited. 63.4% coverage, 3 models only.
StrongREJECT313 (gap: ~213 prompts x new models)+8 models (2 already tested)~$1-41-2 hoursSouly et al. scoring methodology widely adopted.

Tier 1 total: ~416paid,4-16 paid, 0 free tier. ~5-8 hours.

Tier 2: Strengthens Paper (Run Within Sprint 13)

DatasetPrompts to RunModels NeededEst. CostEst. TimeJustification
SORRY-Bench450 (base subset)5-10 models~$1-62-4 hoursXie et al. 2024 fine-grained safety categories. 9,446 prompts imported (full expanded set) — run the 450 base prompts.
ForbiddenQuestions3905-10 models~$1-41-2 hoursWalled AI benchmark, growing citations.
HEx-PHI2905-10 models~$1-31-2 hoursQi et al. 2023 fine-tuning safety.
SimpleSafetyTests1005-10 models~$0.50-130 minQuick win — 100 prompts, fast to run.
TDC2023-RedTeaming1005-10 models~$0.50-130 minTDC competition prompts.

Tier 2 total: ~415paid,4-15 paid, 0 free tier. ~5-9 hours.

Tier 3: Nice-to-Have (Defer Unless Compute Grant Arrives)

DatasetPromptsWhy Defer
BEAVERTAILS3,432Very large. Run a 500-prompt stratified sample if needed.
WildJailbreak1,000 (of 262K)Already 9.5% coverage. Expand to 10 models incrementally.
ToxicChat113Toxicity detection, not jailbreak. Lower relevance to CCS framing.
LLM-Finetuning-Safety17Tiny. Run opportunistically.
DAN-In-The-Wild1,40582.8% coverage already. Expand model count if time allows.

Priority Ranking for CCS Reviewers

What a CCS reviewer expects to see:

  1. AdvBench results — Most-cited. Our gap here is the single most damaging omission. A reviewer searching for “AdvBench” in our paper and finding no cross-reference to our own evaluation would question experimental rigor.

  2. HarmBench results — Second most-cited post-2024. We have partial coverage (63.4%, 3 models). Expanding to 10 models makes this defensible.

  3. JailbreakBench results — We have good coverage here (257 results, 2 models). Expand to 5+ models for robustness.

  4. StrongREJECT results — Souly et al. scoring is becoming the standard. 31.9% coverage is insufficient. Need 10-model comparison.

  5. SORRY-Bench — Increasingly cited for fine-grained safety category analysis. Running the 450 base prompts on 5 models would give us defensible coverage.


Phase 1: AdvBench Baseline (Week of March 24)

# Export AdvBench prompts to benchmark-ready JSONL
python3 tools/database/export_jsonl.py \
  --filter "source_dataset.name = 'AdvBench'" \
  --format archaeology \
  --output data/splits/advbench_full.jsonl

# Run on 10 free-tier models (no cost)
python3 tools/benchmarks/run_benchmark_http.py \
  --scenarios data/splits/advbench_full.jsonl \
  --models \
    "google/gemini-2.0-flash-exp:free" \
    "meta-llama/llama-3.2-3b-instruct:free" \
    "mistralai/devstral-2512:free" \
    "mistralai/mistral-7b-instruct:free" \
    "qwen/qwen3-4b:free" \
    "nvidia/llama-3.1-nemotron-70b-instruct:free" \
  --output runs/advbench_baseline/ \
  --limit 520

# Import traces
python3 tools/database/import_traces.py --traces "runs/advbench_baseline/"

Phase 2: P1 Expansion (Week of March 31)

Run HarmBench, StrongREJECT, JailbreakBench on same 10-model set. Import all traces.

Phase 3: P2 Quick Wins (Sprint 13)

SimpleSafetyTests + TDC2023 + ForbiddenQuestions — small datasets, fast runs.


Cost Summary

PhaseDatasetsPrompts x ModelsFree TierPaid (OpenRouter)
Phase 1AdvBench520 x 6-10$0~$2-4
Phase 2HarmBench, StrongREJECT, JailbreakBench~733 x 10$0~$3-8
Phase 3SORRY-Bench (base), Forbidden, HEx-PHI, SST, TDC~1,330 x 5-10$0~$4-10
Total~12,830 calls$0$9-22

All Phase 1 work can be done entirely on free-tier models at zero cost.


Methodology

  • Database queries run against database/jailbreak_corpus.db (schema v13)
  • Import completeness verified via python3 tools/database/import_public.py --dataset X --dry-run for all 4 P1 datasets
  • Coverage = COUNT(DISTINCT results) / COUNT(DISTINCT prompts) per source_dataset
  • Model count derived from JOIN through evaluation_runs to models table
  • Cost estimates based on OpenRouter pricing as of 2026-03-24 (see docs/FREE_MODEL_RECOMMENDATIONS.md)

This research informs our commercial services. See how we can help →