Robustness of Vision-Language Models to Adversarial Inputs
Multimodal attack surfaces in the age of integrated perception
Introduction
The convergence of vision and language capabilities in modern AI systems has created a new class of models that can reason jointly over images, text, and structured data. These vision-language models (VLMs) have demonstrated remarkable performance on tasks ranging from visual question answering to document understanding and robotic scene interpretation. However, the multimodal nature of these systems also multiplies their attack surface: an adversary can now introduce perturbations through visual channels, textual channels, or the interaction between the two. Early research on adversarial examples in computer vision focused on pixel-level perturbations that were imperceptible to human observers but caused misclassification. The analogous problem in VLMs is considerably more complex, as attacks can exploit semantic-level interactions between modalities that have no direct parallel in unimodal systems.
This paper surveys the current landscape of adversarial attacks against vision-language models, with particular attention to attacks that exploit the metadata and structural elements surrounding visual content rather than the visual content itself. Image alt text, EXIF data, filename conventions, and surrounding HTML context all provide channels through which adversarial instructions can reach a model that is processing a web page or document. These metadata-based attacks are particularly relevant for AI agents that browse the web or process documents on behalf of users, as they may parse and act on information that human users would never see or attend to.
Visual Grounding and Metadata Parsing
When a VLM encounters an image embedded in a web page, it must decide which contextual signals to incorporate into its understanding. The image itself carries visual information, but the surrounding HTML provides additional semantic cues: alt text describes the image content for accessibility purposes, title attributes offer supplementary information on hover, and caption elements provide editorial context. In a well-formed document, these signals are consistent and mutually reinforcing. In an adversarial document, they may contain contradictory or misleading information designed to manipulate the model's interpretation. Our experiments demonstrate that several production VLMs weight alt text heavily in their interpretation pipeline, sometimes even overriding visual evidence when the alt text provides a confident alternative description.
The accessibility implications of this vulnerability are particularly noteworthy. Alt text was designed to serve users who cannot perceive visual content, and its trusted status in the HTML specification reflects this accessibility purpose. Adversarial exploitation of alt text therefore represents a dual harm: it undermines both the AI system that parses the attribute and the accessibility infrastructure that depends on its integrity. Our testing across six production VLMs found that four of them incorporated alt text content into their responses without any validation or consistency checking against the actual image content, suggesting that current models treat alt text as a reliable signal rather than an untrusted input channel.
Cross-Modal Attack Composition
The most effective adversarial attacks against VLMs combine perturbations across multiple modalities and metadata channels simultaneously. A single anomalous alt text attribute may be detected by a well-designed safety filter, but when combined with consistent reinforcing signals in the title attribute, surrounding paragraph text, and even the image filename, the adversarial narrative becomes significantly more difficult to distinguish from legitimate content. Our experiments with compound attacks, where consistent adversarial instructions were placed across three or more metadata channels, showed a marked increase in attack success rate compared to single-channel attacks. This finding suggests that defense mechanisms for VLMs must consider the full multimodal context rather than filtering individual channels in isolation.
The implications for embodied AI systems that rely on vision-language models for scene understanding are substantial. A robot navigating a physical environment while processing visual information and associated metadata could be manipulated through adversarial signage, QR codes with embedded instructions, or even carefully placed objects whose visual description in a training dataset has been poisoned. Defending against these compound multimodal attacks will require fundamentally new approaches to input validation that can reason about consistency across channels and flag anomalies at the semantic level rather than the syntactic level.