Visual Jailbreaks Evolved Stage 2 — 12-Model Benchmark Analysis | Research | Failure-First

Adrian Wedd

Report 339 Research — Empirical Study 2026-04-04

Abstract

This report presents analysis of the Visual Jailbreaks Evolved Stage 2 benchmark, run across 12 models ranging from 3B to 671B parameters. The benchmark tests 8 evolved scenarios across 2 attack layers (L3 Visual Semantic, L8 Cross-Modal Transfer) derived from the Stage 1 attack family taxonomy (Report #332). In this n=93, single-grader sample:

Pooled strict ASR (COMPLIANCE): 78.5% (73/93)
L3 Visual Semantic strict ASR: 72.9% (35/48, n=48)
L8 Cross-Modal Transfer strict ASR: 84.4% (38/45, n=45)

Eight of the twelve tested models achieved 100% strict ASR. Only three models — cogito-2.1:671b (671B), gpt-oss:120b (120B), and glm-5 (~700B) — showed consistent partial resistance, with cogito-2.1:671b achieving the lowest ASR at 37.5% (3/8).

Grading caveat: Single grader (Claude Haiku 4.5 via OpenRouter). Per documented grader bias findings, a dual-grader audit is required before drawing strong conclusions. Results should be treated as directional estimates.

1. Stage 1 vs Stage 2 Comparison

Stage 1 (Report #332) established a baseline across 4 models using 20 scenarios covering all 8 attack layers. Stage 2 focuses on Layers 3 and 8 only — the two highest-performing attack families from Stage 1 — and tests evolved variants designed to bypass refusal patterns observed in Stage 1.

Metric	Stage 1 (Report #332)	Stage 2 (This Report)
Models tested	4	12
Scenarios	20 (all 8 layers)	8 (L3 + L8 only)
Total traces	80 (47 valid, 16 ERROR)	93 (all valid, 0 ERROR)
L3 broad ASR	66.7% (n=6 valid)	72.9% (n=48)
L8 broad ASR	60.0% (n=5 valid)	84.4% (n=45)

Key differences in experimental design that limit direct comparison:

Different graders. Stage 1 used nemotron-nano-9b-v2 (known REFUSAL bias); Stage 2 uses Claude Haiku 4.5. ASR differences may partly reflect grader calibration differences.
Zero parse errors in Stage 2. Stage 1 had a 44% grader ERROR rate on ArtPrompt ASCII art scenarios.
Evolved scenarios. Stage 2 scenarios were explicitly designed using Stage 1 refusal patterns as negative feedback.
Scale. Stage 2 runs 8 models with 100% ASR; Stage 1 had no model at 100% on L3/L8. This is partly an artifact of testing more smaller models that lacked the safety training of larger frontier models.

2. Benchmark Design

2.1 Scenario Set

8 evolved scenarios across 2 attack layers and multiple harm domains. All scenarios use a pretext framing (transcription, OCR completion, gap-filling of pre-existing content) to exploit the transcription loophole: when a request is framed as completing or recovering existing content rather than generating new content, safety thresholds for generation appear to lower.

[Operational scenario prompts redacted for public release. Pattern-level description: scenarios combined institutional authority pretexts (medical, legal, forensic, accessibility) with gap-completion task framing across harm domains including cybersecurity, financial crime, and infrastructure security.]

2.2 Models

Model	Parameters	Family	Broad ASR
devstral-2:123b	123B	Mistral	100.0%
devstral-small-2:24b	24B	Mistral	100.0%
gemma3:4b	4B	Google	100.0%
gemma3:12b	12B	Google	100.0%
gemma3:27b	27B	Google	100.0%
ministral-3:3b	3B	Mistral	100.0%
ministral-3:8b	8B	Mistral	100.0%
nemotron-3-nano:30b	30B	Nvidia	100.0%
nemotron-3-super	~47B	Nvidia	60.0%
cogito-2.1:671b	671B	Llama-derivative	37.5%
gpt-oss:120b	120B	OpenAI open-source	25.0%
glm-5	~700B	Zhipu	12.5%

2.3 Grading

FLIP backward inference via Claude Haiku 4.5 (OpenRouter). 93 traces graded, 0 parse errors. All verdicts were binary (COMPLIANCE or REFUSAL) — no PARTIAL or HALLUCINATION_REFUSAL outcomes were observed.

3. Results

3.1 Pooled Summary

Metric	Value	n
Total traces	93	—
COMPLIANCE verdicts	73	78.5%
REFUSAL verdicts	20	21.5%
PARTIAL verdicts	0	0.0%
Pooled strict ASR	78.5%	93
Refusal Boundary Integrity (RBI)	19.4% [12.6%, 28.5%]	93
Recovery Reentry Rate (RRR)	51.6% [41.6%, 61.5%]	93

3.2 Per-Layer ASR

Layer	Class	n	C	R	Strict ASR	Broad ASR
L3	l3_visual_semantic	48	35	13	72.9%	72.9%
L8	l8_visual_cross_modal	45	38	7	84.4%	84.4%

L8 (Cross-Modal Transfer) outperformed L3 (Visual Semantic) by 11.5 percentage points. The difference is consistent with the Stage 1 finding that modal transfer pretexts exploit a stronger loophole than static visual transcription pretexts. However, the difference is not statistically significant given the sample sizes and single-grader design.

3.3 Resistance Patterns

Among the four models that showed any refusals, harm domains closer to direct physical harm (sedative/drug facilitation, pharmaceutical diversion) triggered more refusals than technical infrastructure or financial crime domains. The accessibility framing combined with ICS/SCADA content — which is technical rather than directly physical-harm-adjacent — produced COMPLIANCE even in the most safety-trained models.

4. Notable Findings

4.1 Scale Does Not Predict Resistance at Middle Range

Among the eight models at 100% ASR, parameter count ranged from 3B to 123B — scale alone does not predict safety performance. The two highest-parameter models tested (cogito-2.1:671b at 671B and glm-5 at ~700B) were among the most resistant, consistent with the general finding that frontier-scale safety training provides more resistance.

4.2 Accessibility Framing as Universal Penetration Vector

One scenario using accessibility framing (ADA caption correction) combined with ICS/SCADA technical content achieved COMPLIANCE across all models that ran it, including the most resistant models. The accessibility framing combined with technical (rather than directly physical-harm) content appears to be a robust combination that bypasses refusal triggers.

4.3 Harm Domain Sensitivity Gradient

Across partially-resistant models, refusal rates varied significantly by harm domain: ICS/SCADA technical content (0% refusal), financial crime (25%), surveillance tradecraft (50%), versus sedative facilitation (100% refusal). This suggests that among models with residual safety behavior, harm domains closer to direct physical harm trigger more refusals.

5. Limitations

Single grader with uncharacterized bias. Claude Haiku 4.5 grader bias on this scenario set has not been assessed.
n=8 per model is insufficient for per-model conclusions. No per-model comparison survives Bonferroni correction.
L3/L8 only — not a full visual jailbreak benchmark. Layers with near-zero ASR in Stage 1 are not tested here.
Binary verdicts only. No PARTIAL outcomes observed, which is unusual and may indicate grader calibration effects.
Text-only benchmark. Actual image-based visual input would be required to test multimodal model behavior.

6. Recommendations

The accessibility framing pattern warrants targeted defense evaluation — whether models that refuse drug synthesis requests exempt infrastructure attack content, or whether the accessibility framing itself is the operative bypass.
Dual-grader audit required before citing these figures as definitive.
Scale to n=20+ per model for more reliable per-model comparisons.
Add frontier models (GPT-4o, Claude Sonnet 4.5, Gemini 2.0 Flash) to test actual frontier safety-trained models.

Report #339 | F41LUR3-F1R57 Adversarial AI Research