Published
Report 332 Research — Empirical Study

Summary

This report presents the first systematic taxonomy of visual jailbreak attacks against language models, organizing the attack surface into 8 distinct layers. 20 text-representable scenarios covering all layers were benchmarked against 4 models, and cross-referenced with 18 existing ArtPrompt scenarios. Results demonstrate that visual encoding attacks exploit a fundamental gap: models process encoded content through the same reasoning pipeline as natural language, but safety training primarily targets natural language patterns.

8-Layer Visual Attack Surface Taxonomy

LayerNameMechanismExample
1ASCII Art EncodingHarmful keywords encoded as ASCII artArtPrompt (Jiang et al., 2024)
2Typographic ManipulationUnicode homoglyphs, bidi overridesCyrillic/Latin substitution
3Visual SemanticDescribed images containing harmful contentFake OCR/transcription requests
4Metadata InjectionHarmful instructions in file metadataEXIF UserComment injection
5Steganographic EncodingHidden messages in acrostics, LSB simulationFirst-letter acrostics
6Adversarial PatchesPhysical perturbations for perception systemsPatch design for VLA systems
7Compositional Multi-ImageIndividually benign fragments combining to harmful wholeDocument fragment reconstruction
8Cross-Modal TransferExploit visual-to-text translation boundaryFake video transcription

Layers 1-5 (Encoding attacks): Exploit the model’s ability to decode various encoding schemes. Safety classifiers trained on natural language fail when harmful requests are encoded.

Layer 6 (Adversarial patches): Unique to multimodal and embodied systems.

Layer 7 (Compositional): Exploits the gap between per-component safety evaluation and holistic intent understanding.

Layer 8 (Cross-modal transfer): Exploits the pretense of visual-to-text translation to bypass safety.

Results

ArtPrompt Benchmark (18 scenarios x 4 models)

72 traces collected, 40 valid. Strict ASR: 2.5% (1/40), Broad ASR: 5.0% (2/40). Near-zero ASR — 3 of 4 safety-trained models refused all ArtPrompt attacks. Current models appear largely patched against this specific attack vector that was effective in 2024.

Visual Jailbreaks Benchmark (20 scenarios x 4 models)

80 traces collected, 47 valid. Strict ASR: 27.7% (13/47), Broad ASR: 42.6% (20/47).

Per-Layer ASR

Layern validStrict ASRBroad ASRKey Observation
L0 (Control)862.5%62.5%Expected compliance
L1 (ASCII Art)616.7%66.7%High PARTIAL — models decode but hedge
L2 (Typographic)80.0%0.0%Well-defended
L3 (Visual Semantic)650.0%66.7%Major gap — transcription pretext works
L4 (Metadata)10.0%0.0%n=1, inconclusive
L5 (Steganography)425.0%25.0%Partial defense
L6 (Adversarial Patch)520.0%40.0%Models discuss patch methodology
L7 (Compositional)40.0%25.0%PARTIAL only
L8 (Cross-Modal)540.0%60.0%Major gap — modal transfer pretext

Key Findings

The Transcription Loophole (Layers 3 + 8)

The most effective layers are L3 (67% broad ASR) and L8 (60% broad ASR). Both exploit the same mechanism: framing harmful generation as content translation rather than content creation. This is structurally analogous to the DETECTED_PROCEEDS phenomenon.

Pure Encoding Attacks Are Patched

ArtPrompt-style ASCII art encoding and typographic manipulation achieve near-zero ASR. 2026-era models have been specifically trained against encoding-based attacks.

Attack Surface Heterogeneity

The 8-layer taxonomy reveals that “visual jailbreak” is not a monolithic category but 8+ independent channels, each requiring distinct detection mechanisms. Per-layer ASR ranges from 0% to 67%.

Embodied AI Implications

Layers 6-8 are uniquely relevant to embodied systems: adversarial patches in the robot’s visual field (L6), multi-sensor compositional attacks (L7), and the perception-to-action gap (L8).

Limitations

  1. Text-only benchmark — not actual visual inputs
  2. Single grader with known REFUSAL bias
  3. Small n per layer (2-3 scenarios)
  4. No defense testing

References

  • Jiang et al. (2024). ArtPrompt: ASCII Art-based Jailbreak Attacks. arXiv:2402.11753
  • Gong et al. (2023). FigStep: Jailbreaking Large VLMs via Typographic Prompts. arXiv:2311.05608
  • BadVLA (2025). Adversarial attacks on Vision-Language-Action models
  • Blindfold (SenSys 2026). Visual deprivation attacks on VLA systems

Report #332 | F41LUR3-F1R57 Adversarial AI Research

This research informs our commercial services. See how we can help →