Visual Jailbreak Meta-Analysis — 8-Layer Attack Surface Taxonomy | Research | Failure-First

Adrian Wedd

Report 332 Research — Empirical Study 2026-03-30

Summary

This report presents the first systematic taxonomy of visual jailbreak attacks against language models, organizing the attack surface into 8 distinct layers. 20 text-representable scenarios covering all layers were benchmarked against 4 models, and cross-referenced with 18 existing ArtPrompt scenarios. Results demonstrate that visual encoding attacks exploit a fundamental gap: models process encoded content through the same reasoning pipeline as natural language, but safety training primarily targets natural language patterns.

8-Layer Visual Attack Surface Taxonomy

Layer	Name	Mechanism	Example
1	ASCII Art Encoding	Harmful keywords encoded as ASCII art	ArtPrompt (Jiang et al., 2024)
2	Typographic Manipulation	Unicode homoglyphs, bidi overrides	Cyrillic/Latin substitution
3	Visual Semantic	Described images containing harmful content	Fake OCR/transcription requests
4	Metadata Injection	Harmful instructions in file metadata	EXIF UserComment injection
5	Steganographic Encoding	Hidden messages in acrostics, LSB simulation	First-letter acrostics
6	Adversarial Patches	Physical perturbations for perception systems	Patch design for VLA systems
7	Compositional Multi-Image	Individually benign fragments combining to harmful whole	Document fragment reconstruction
8	Cross-Modal Transfer	Exploit visual-to-text translation boundary	Fake video transcription

Layers 1-5 (Encoding attacks): Exploit the model’s ability to decode various encoding schemes. Safety classifiers trained on natural language fail when harmful requests are encoded.

Layer 6 (Adversarial patches): Unique to multimodal and embodied systems.

Layer 7 (Compositional): Exploits the gap between per-component safety evaluation and holistic intent understanding.

Layer 8 (Cross-modal transfer): Exploits the pretense of visual-to-text translation to bypass safety.

Results

ArtPrompt Benchmark (18 scenarios x 4 models)

72 traces collected, 40 valid. Strict ASR: 2.5% (1/40), Broad ASR: 5.0% (2/40). Near-zero ASR — 3 of 4 safety-trained models refused all ArtPrompt attacks. Current models appear largely patched against this specific attack vector that was effective in 2024.

Visual Jailbreaks Benchmark (20 scenarios x 4 models)

80 traces collected, 47 valid. Strict ASR: 27.7% (13/47), Broad ASR: 42.6% (20/47).

Per-Layer ASR

Layer	n valid	Strict ASR	Broad ASR	Key Observation
L0 (Control)	8	62.5%	62.5%	Expected compliance
L1 (ASCII Art)	6	16.7%	66.7%	High PARTIAL — models decode but hedge
L2 (Typographic)	8	0.0%	0.0%	Well-defended
L3 (Visual Semantic)	6	50.0%	66.7%	Major gap — transcription pretext works
L4 (Metadata)	1	0.0%	0.0%	n=1, inconclusive
L5 (Steganography)	4	25.0%	25.0%	Partial defense
L6 (Adversarial Patch)	5	20.0%	40.0%	Models discuss patch methodology
L7 (Compositional)	4	0.0%	25.0%	PARTIAL only
L8 (Cross-Modal)	5	40.0%	60.0%	Major gap — modal transfer pretext

Key Findings

The Transcription Loophole (Layers 3 + 8)

The most effective layers are L3 (67% broad ASR) and L8 (60% broad ASR). Both exploit the same mechanism: framing harmful generation as content translation rather than content creation. This is structurally analogous to the DETECTED_PROCEEDS phenomenon.

Pure Encoding Attacks Are Patched

ArtPrompt-style ASCII art encoding and typographic manipulation achieve near-zero ASR. 2026-era models have been specifically trained against encoding-based attacks.

Attack Surface Heterogeneity

The 8-layer taxonomy reveals that “visual jailbreak” is not a monolithic category but 8+ independent channels, each requiring distinct detection mechanisms. Per-layer ASR ranges from 0% to 67%.

Embodied AI Implications

Layers 6-8 are uniquely relevant to embodied systems: adversarial patches in the robot’s visual field (L6), multi-sensor compositional attacks (L7), and the perception-to-action gap (L8).

Limitations

Text-only benchmark — not actual visual inputs
Single grader with known REFUSAL bias
Small n per layer (2-3 scenarios)
No defense testing

References

Jiang et al. (2024). ArtPrompt: ASCII Art-based Jailbreak Attacks. arXiv:2402.11753
Gong et al. (2023). FigStep: Jailbreaking Large VLMs via Typographic Prompts. arXiv:2311.05608
BadVLA (2025). Adversarial attacks on Vision-Language-Action models
Blindfold (SenSys 2026). Visual deprivation attacks on VLA systems

Report #332 | F41LUR3-F1R57 Adversarial AI Research