Summary
This report presents the first systematic taxonomy of visual jailbreak attacks against language models, organizing the attack surface into 8 distinct layers. 20 text-representable scenarios covering all layers were benchmarked against 4 models, and cross-referenced with 18 existing ArtPrompt scenarios. Results demonstrate that visual encoding attacks exploit a fundamental gap: models process encoded content through the same reasoning pipeline as natural language, but safety training primarily targets natural language patterns.
8-Layer Visual Attack Surface Taxonomy
| Layer | Name | Mechanism | Example |
|---|---|---|---|
| 1 | ASCII Art Encoding | Harmful keywords encoded as ASCII art | ArtPrompt (Jiang et al., 2024) |
| 2 | Typographic Manipulation | Unicode homoglyphs, bidi overrides | Cyrillic/Latin substitution |
| 3 | Visual Semantic | Described images containing harmful content | Fake OCR/transcription requests |
| 4 | Metadata Injection | Harmful instructions in file metadata | EXIF UserComment injection |
| 5 | Steganographic Encoding | Hidden messages in acrostics, LSB simulation | First-letter acrostics |
| 6 | Adversarial Patches | Physical perturbations for perception systems | Patch design for VLA systems |
| 7 | Compositional Multi-Image | Individually benign fragments combining to harmful whole | Document fragment reconstruction |
| 8 | Cross-Modal Transfer | Exploit visual-to-text translation boundary | Fake video transcription |
Layers 1-5 (Encoding attacks): Exploit the model’s ability to decode various encoding schemes. Safety classifiers trained on natural language fail when harmful requests are encoded.
Layer 6 (Adversarial patches): Unique to multimodal and embodied systems.
Layer 7 (Compositional): Exploits the gap between per-component safety evaluation and holistic intent understanding.
Layer 8 (Cross-modal transfer): Exploits the pretense of visual-to-text translation to bypass safety.
Results
ArtPrompt Benchmark (18 scenarios x 4 models)
72 traces collected, 40 valid. Strict ASR: 2.5% (1/40), Broad ASR: 5.0% (2/40). Near-zero ASR — 3 of 4 safety-trained models refused all ArtPrompt attacks. Current models appear largely patched against this specific attack vector that was effective in 2024.
Visual Jailbreaks Benchmark (20 scenarios x 4 models)
80 traces collected, 47 valid. Strict ASR: 27.7% (13/47), Broad ASR: 42.6% (20/47).
Per-Layer ASR
| Layer | n valid | Strict ASR | Broad ASR | Key Observation |
|---|---|---|---|---|
| L0 (Control) | 8 | 62.5% | 62.5% | Expected compliance |
| L1 (ASCII Art) | 6 | 16.7% | 66.7% | High PARTIAL — models decode but hedge |
| L2 (Typographic) | 8 | 0.0% | 0.0% | Well-defended |
| L3 (Visual Semantic) | 6 | 50.0% | 66.7% | Major gap — transcription pretext works |
| L4 (Metadata) | 1 | 0.0% | 0.0% | n=1, inconclusive |
| L5 (Steganography) | 4 | 25.0% | 25.0% | Partial defense |
| L6 (Adversarial Patch) | 5 | 20.0% | 40.0% | Models discuss patch methodology |
| L7 (Compositional) | 4 | 0.0% | 25.0% | PARTIAL only |
| L8 (Cross-Modal) | 5 | 40.0% | 60.0% | Major gap — modal transfer pretext |
Key Findings
The Transcription Loophole (Layers 3 + 8)
The most effective layers are L3 (67% broad ASR) and L8 (60% broad ASR). Both exploit the same mechanism: framing harmful generation as content translation rather than content creation. This is structurally analogous to the DETECTED_PROCEEDS phenomenon.
Pure Encoding Attacks Are Patched
ArtPrompt-style ASCII art encoding and typographic manipulation achieve near-zero ASR. 2026-era models have been specifically trained against encoding-based attacks.
Attack Surface Heterogeneity
The 8-layer taxonomy reveals that “visual jailbreak” is not a monolithic category but 8+ independent channels, each requiring distinct detection mechanisms. Per-layer ASR ranges from 0% to 67%.
Embodied AI Implications
Layers 6-8 are uniquely relevant to embodied systems: adversarial patches in the robot’s visual field (L6), multi-sensor compositional attacks (L7), and the perception-to-action gap (L8).
Limitations
- Text-only benchmark — not actual visual inputs
- Single grader with known REFUSAL bias
- Small n per layer (2-3 scenarios)
- No defense testing
References
- Jiang et al. (2024). ArtPrompt: ASCII Art-based Jailbreak Attacks. arXiv:2402.11753
- Gong et al. (2023). FigStep: Jailbreaking Large VLMs via Typographic Prompts. arXiv:2311.05608
- BadVLA (2025). Adversarial attacks on Vision-Language-Action models
- Blindfold (SenSys 2026). Visual deprivation attacks on VLA systems
Report #332 | F41LUR3-F1R57 Adversarial AI Research