Daily Paper

Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models

A steganography-based attack that hides malicious instructions inside images using least significant bit encoding, achieving 90%+ jailbreak success rates on GPT-4o and Gemini in under three queries.

arXiv:2505.16446 Empirical Study

Zhaoxin Wang, Handing Wang, Cong Tian, Yaochu Jin

jailbreakvision-language-modelssteganographycross-modal-attacksmultimodal-safety

Every safety mechanism for multimodal AI operates on an implicit assumption: malicious instructions look different from benign ones. IJA (Implicit Jailbreak Attacks) directly attacks this assumption by making harmful instructions invisible to everything except the model’s own cross-modal reasoning.

The technique is deceptively simple. IJA embeds malicious instructions into an image using least significant bit (LSB) steganography — a decades-old information-hiding technique that modifies the low-order bits of pixel values in ways imperceptible to human vision. The embedded instruction is then paired with a seemingly innocuous text prompt that references the image. The multimodal LLM, processing the image through its vision encoder, “sees” the hidden instruction and acts on it, even though no safety filter inspecting the visible content of either input would detect anything unusual.

To further improve attack effectiveness across different model families, the authors add adversarial suffixes generated by a surrogate model and a template optimization module that iteratively refines both the prompt and the embedding strategy based on model feedback. The result is an attack that adapts to each target model’s specific alignment properties rather than applying a static template.

Why This Attack Is Different

Previous multimodal jailbreaks worked by making images adversarially noisy — perturbing pixel values to steer model behavior through the vision encoder. These approaches are detectable because the perturbations, while often invisible to casual inspection, create statistical anomalies that content classifiers can flag. They’re also fragile: adversarial perturbations calibrated for one model often fail to transfer to another architecture.

IJA is fundamentally different because it exploits the model’s actual cross-modal comprehension rather than its low-level feature extraction. The steganographically embedded instruction is semantically meaningful — it’s real text, just hidden in the image’s pixel data. When the model processes the image, it reads the instruction through whatever cross-modal mechanism it uses to understand visual content. The attack succeeds not by confusing the model but by communicating with it covertly. Safety filters that look at what a prompt appears to say will never see the actual instruction being conveyed.

Experimental Results

Against GPT-4o and Gemini-1.5 Pro, IJA achieves attack success rates above 90%, using an average of only three queries per attack. The query efficiency is as significant as the success rate. Many jailbreak approaches require hundreds of attempts, making them detectable through rate-limiting and anomaly monitoring. Three queries per attack means IJA flies under systems designed to catch brute-force attempts, and the low query count also reduces the risk of triggering abuse detection heuristics that flag unusual patterns of failed requests.

The template optimization module’s role in these results deserves attention: it enables the attack to generalize across models with different safety training, because it treats each model’s responses as a feedback signal to refine the approach rather than relying on a fixed attack recipe.

The Embodied AI and VLA Attack Surface

For embodied AI systems that process visual inputs — VLA models controlling robots, autonomous navigation systems, warehouse automation agents — IJA maps to an underexplored physical-world attack surface. An adversary with any access to the camera’s field of view could embed hidden instructions in images the system’s camera captures from the environment. A printed poster on a wall, an object placed in the workspace, or a screen displaying a prepared image could all carry hidden malicious instructions that redirect robot behavior while appearing completely benign to human observers and automated content monitoring.

This attack modality is particularly concerning because it eliminates the need for network-level access to the target system. An attacker who can place an image anywhere the camera can see it — including in a victim’s email, on a shared display, or in a document the agent is asked to process — can potentially issue commands to the system. The attack is also persistent: once a prepared image is deployed in the environment, it continues to influence the system each time it appears in the camera’s view.

Implications for Safety Evaluation

The 90%+ success rate against frontier commercial models with claimed safety training reveals that current multimodal alignment is not robust to covert channel attacks. Existing safety evaluation benchmarks test against visible adversarial content — images that look disturbing, text prompts that explicitly request harmful outputs. IJA demonstrates that the actual attack surface includes inputs that contain no obviously harmful visible content whatsoever.

This points to a gap in how multimodal safety is evaluated. Benchmarks that only measure resistance to explicit harmful content miss the entire category of covert channel attacks. A model can achieve near-perfect scores on standard safety benchmarks while remaining highly vulnerable to steganographic instruction injection.

Defending against IJA requires either detecting steganographic content statistically (computationally expensive and prone to false positives on legitimate images), architectural changes that prevent cross-modal instruction following through the visual pathway, or runtime monitoring that independently validates the semantic content of model actions against a specification — bringing us back to the kind of formal safety guarantees that frameworks like VeriGuard are exploring. None of these defenses are straightforward, and none are currently standard in deployed systems.

Read the full paper on arXiv · PDF