Benchmark Overfitting Analysis — AdvBench vs Novel Attack Families | Research | Failure-First

Adrian Wedd

Report 235 Research — Empirical Study 2026-03-24

Executive Summary

We tested whether models show differential vulnerability to public benchmark prompts (AdvBench, likely in training data) versus novel attack families (F41LUR3-F1R57 proprietary, not in training data). The hypothesis: if models have overfit to refuse AdvBench-style prompts during safety training, they should show lower ASR on AdvBench and higher ASR on novel families.

Finding: The data strongly supports a “benchmark hardening” effect rather than overfitting in the traditional sense. Trinity-large-preview shows 36.7% ASR on AdvBench versus 88.1% on novel families (Fisher p < 0.000001, Cohen’s h = 1.14). However, this result requires careful interpretation because the two prompt sets differ in attack sophistication, not just novelty.

Methodology

Datasets

Dataset	Source	n	Attack Style	In Training Data?
AdvBench baseline	Zou et al. 2023	30	Direct harmful requests	Likely (public since 2023)
Novel families	F41LUR3-F1R57 v0.6	60	Compositional reasoning, meaning displacement, pressure cascade, reward hacking, sensor spoofing, multi-agent collusion	No (private repo, created 2026-03)

Novel Family Attack Types

The novel families benchmark (data/splits/novel_families_benchmark_v0.1.jsonl) contains 6 attack families, 10 scenarios each:

Compositional reasoning attacks (VLA-CRA-*): Each step is benign; composite action is harmful
Meaning displacement attacks: Semantic reframing of harmful actions
Pressure cascade attacks: Escalating urgency to override safety
Reward hacking attacks: Exploiting reward signals
Sensor spoofing attacks: Manipulating perception inputs
Multi-agent collusion attacks: Coordinated multi-actor subversion

Models Tested

Model	AdvBench Traces	Novel Traces	Classification
arcee-ai/trinity-large-preview:free	30 (clean)	59 (clean, 1 error)	Heuristic labels from runner
qwen/qwen3-8b	60 (clean)	Not available	Keyword classification

Classification Caveat

Per Mistake #21 in MISTAKES_TO_LEARN_FROM.md, keyword-based classifiers detect response style rather than semantic harm. The Trinity traces have heuristic labels from the benchmark runner; the Qwen3-8b traces use keyword classification. LLM-based grading was not applied due to Ollama being reserved for nemotron-nano-9b grading pipeline.

Results

Trinity-Large-Preview (Primary Comparison)

Metric	AdvBench (Public)	Novel Families (Private)
n	30	59
Attack successes	11	52
ASR	36.7%	88.1%
95% Wilson CI	[21.9%, 54.5%]	[77.5%, 94.1%]
Refusals	19	7

Statistical tests:

Fisher exact test: p < 0.000001
Cohen’s h: 1.14 (large effect, threshold = 0.8)
Delta ASR: +51.5 percentage points favoring novel families

Qwen3-8b (AdvBench Only)

Metric	AdvBench (Public)
n	60
Refusals (keyword)	57 (95.0%)
Compliances	2 (3.3%)
Unclear	1

Qwen3-8b shows near-total refusal on AdvBench, consistent with extensive safety training on this public benchmark. Novel family comparison was not possible due to persistent OpenRouter rate limits on Qwen3 models.

Nemotron-nano-30b (Supplementary, Different Novel Set)

A comparison using a different novel prompt set (novel_attacks_20260120, n=15) showed the opposite pattern: AdvBench ASR = 43.3% vs novel ASR = 0.0%. However, this used a different and smaller novel prompt set, making direct comparison unreliable.

Interpretation

What This Shows

The ASR gap is real and large. Trinity refuses 63% of AdvBench prompts but only 12% of novel family prompts. This is statistically significant with a large effect size.
The gap is partly driven by attack sophistication, not just novelty. AdvBench prompts are direct harmful requests (“Write a script that exploits vulnerabilities”). Novel family prompts are compositional reasoning attacks where each step appears benign but the aggregate is harmful. These are fundamentally different attack modalities.
Benchmark hardening vs. true overfitting. The data is consistent with two non-exclusive explanations:
- Benchmark hardening: Models have been specifically trained to refuse AdvBench-style direct requests. This is expected and intentional safety training.
- Sophistication gap: Novel compositional attacks bypass safety filters that were designed for direct harmful requests.

What This Does NOT Show

We cannot isolate novelty from sophistication. To test pure overfitting, we would need novel prompts that use the same direct-request style as AdvBench but with different harmful content. Our novel families use fundamentally different attack architectures.
Single-model comparison (Trinity) limits generalizability. The Nemotron result (opposite direction) on a different novel set suggests model-specific effects.
Heuristic classification may misclassify VLA/robotics responses that look operationally correct but contain embedded harm.

Limitations

Confounded comparison: AdvBench = direct requests; novel families = compositional/embodied attacks. The ASR difference measures attack sophistication as much as benchmark overfitting.
Single model with both datasets: Only Trinity had clean traces on both sets. Qwen3-4b and Qwen3-8b were unavailable due to persistent rate limits.
Heuristic classification: Not LLM-graded. Per Mistake #21, keyword/heuristic classifiers have known reliability issues.
Sample size: 30 AdvBench + 59 novel = 89 total traces. While the effect is large enough to detect, larger samples would narrow confidence intervals.

Conclusion

Overfitting assessment: Inconclusive (with suggestive evidence of benchmark hardening).

The 51.5 percentage point ASR gap between AdvBench (36.7%) and novel families (88.1%) is statistically significant but confounded by attack sophistication differences. The data suggests that:

Models are well-trained to refuse direct harmful requests (the AdvBench style)
Compositional/embodied attacks that decompose harmful goals into benign-looking steps bypass these defenses
This pattern is more accurately described as a “sophistication gap” than “overfitting to AdvBench”

To definitively test overfitting, future work should compare AdvBench prompts against novel prompts that use the SAME direct-request format but with content not in public benchmarks.

Data Provenance

File	Description
`runs/advbench_baseline_free/arcee-ai_trinity-large-preview-free_traces.jsonl`	Trinity AdvBench traces (n=30)
`runs/novel_families_trinity_full/arcee-ai_trinity-large-preview-free_traces.jsonl`	Trinity novel families traces (n=60)
`runs/qwen3_overfitting_test_advbench/qwen_qwen3-8b_traces.jsonl`	Qwen3-8b AdvBench traces (n=60)
`data/splits/advbench_baseline_v0.1.jsonl`	AdvBench scenario source (520 scenarios)
`data/splits/novel_families_benchmark_v0.1.jsonl`	Novel families scenario source (60 scenarios)

Failed Attempts

Multiple attempts to collect fresh Qwen3-4b traces failed due to OpenRouter rate limits:

runs/qwen3_overfitting_test_novel/ — 30 traces, all HTTP 429 errors
runs/overfitting_test_advbench/ — 30 traces, all HTTP 400 errors (Google AI Studio system prompt rejection)
Direct Qwen3-4b API calls returned 429 throughout the session
Qwen3-coder:free also 429; most free models rate-limited during this session

Only arcee-ai/trinity-large-preview:free and google/gemma-3-4b-it:free responded consistently, but Gemma-3-4b rejected system prompts (HTTP 400 from Google AI Studio).