Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models
Systematically evaluates typographic prompt injection attacks on four vision-language models across varying font sizes and visual conditions, correlating text-image embedding distance to attack...
Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models
1. Introduction: The Invisible Threat in Plain Sight
As Vision-Language Models (VLMs) evolve into the “perceptual backbone” of autonomous systems—powering everything from browser automation and computer-use agents to embodied robots—they are increasingly relied upon to serve as the primary interface between digital “brains” and the physical world. However, this reliance introduces a critical vulnerability: typographic prompt injection.
In these attacks, adversarial instructions (such as commands to bypass safety protocols or exfiltrate data) are rendered as pixels rather than sent as raw text strings. Because many current safety filters are optimized for language-only inputs, these malicious commands can hide in plain sight. If the VLM’s vision encoder—its “eyes”—cannot distinguish between a benign image and a rendered instruction, the LLM component—its “brain”—will execute the command as if it were a legitimate directive. This paper explores the mechanistic failures of these backbones and why some “eyes” are far more easily deceived than others.
2. The Experiment: Stress-Testing the Giants
To quantify this threat, researchers conducted a comprehensive stress test using 1,000 adversarial prompts curated from SALAD-Bench, spanning harm categories such as malicious use and misinformation. These prompts were rendered as images and evaluated against four frontier models:
- GPT-4o (OpenAI)
- Claude Sonnet 4.5 (Anthropic)
- Mistral-Large-3 (Mistral AI)
- Qwen3-VL-4B-Instruct (Alibaba)
The study systematically manipulated two primary variables: legibility and visual integrity. Researchers tested 12 font sizes (from 6px to 28px) and 10 visual transformations, including rotations and “triple degradations,” to determine the exact threshold at which a model’s safety alignment collapses in the face of visual text.
3. Key Finding 1: The Font Size “Sweet Spot”
The research uncovered a definitive “threshold pattern” in attack success. For a typographic injection to work, the VLM must first achieve a baseline of OCR-like recognition. At 6px, most models are effectively “blind” to the attack, resulting in a negligible Attack Success Rate (ASR). However, the danger escalates sharply once the font reaches the 10–12px range, where the model begins to reliably process the instruction.
The following table illustrates the ASR trends alongside the raw text baseline, highlighting how susceptibility changes across modalities:
| Model | 6px ASR | 10px ASR | Peak Image ASR (Size) | Raw Text ASR (Baseline) |
|---|---|---|---|---|
| GPT-4o | 0.3% | 6.4% | 7.7% (at 20px) | 35.6% |
| Claude Sonnet 4.5 | 1.2% | 21.6% | 21.6% (at 10px) | 46.6% |
| Mistral-Large-3 | 15.0% | 73.5% | 75.5% (at 12px) | 85.0% |
| Qwen3-VL-4B-Instruct | 23.9% | 43.1% | 48.2% (at 20px) | 48.9% |
4. Key Finding 2: The Modality Paradox (Text vs. Image)
A primary takeaway for AI safety practitioners is the “Modality Safety Asymmetry.” While it is often assumed that image-based attacks are more potent because they “bypass” text filters, the data reveals a more nuanced reality. For top-tier proprietary models like GPT-4o and Claude, typographic rendering actually provides an implicit visual defense.
GPT-4o is approximately 5x safer against images than raw text (8% vs 36% ASR), while Claude is roughly 2x safer. In contrast, Mistral and Qwen3-VL exhibit “Vulnerability Parity,” remaining equally susceptible regardless of the format. This discrepancy exists because visual encoders currently receive significantly weaker safety alignment than their language counterparts. In models like Mistral, the “safety gap” between how the model treats a text string and a legible pixelated instruction is almost non-existent.
5. The Technical Breakthrough: Embedding Distance as a Warning Signal
The study’s most significant contribution is the discovery of a “mechanistic” link between text-image embedding alignment and attack success. By analyzing the embedding space of JinaCLIP and Qwen3-VL-Embedding-2B, researchers found that the likelihood of an attack succeeding is a direct function of how “close” the image is to the raw text in the model’s internal representation.
The Mechanistic Insight: Using a calculation of L2 distance between normalized embeddings (Equation 1), researchers identified powerful Pearson correlation coefficients:
- JinaCLIP: to ()
- Qwen3-VL-Embedding-2B: to ()
This suggests that embedding distance is a reliable, model-agnostic predictor of attack effectiveness. Notably, decoder-based embedding models (like Qwen3) demonstrated significantly higher sensitivity to font size, showing a 24% L2 range compared to the 14% range seen in CLIP-family encoders. Essentially, as the L2 distance decreases, the model is “seeing” the text more clearly, and the ASR rises proportionally.
6. Real-World Chaos: Blur, Rotation, and Resilience
In practical deployments, typographic attacks are rarely “clean.” To account for this, the researchers applied visual degradations, revealing a stark lack of Geometric Robustness in certain architectures:
- Asymmetric Robustness: GPT-4o and Claude remained virtually unchanged when images were rotated (30° to 90°). However, rotation crippled Mistral, which suffered a 50% drop in ASR when the text was merely tilted.
- The Collapse of Success: Applying “triple degradation”—a combination of blur, noise, and low contrast—increased the embedding distance by 10–12%. This caused ASR to collapse across the board; Claude’s success rate, for instance, plummeted to a mere 0.7%.
These findings indicate that while high-end models possess the geometric robustness to handle real-world distortions like camera angles, all models are currently susceptible to “un-reading” instructions when visual quality is sufficiently compromised.
7. Conclusion: Tactical Takeaways for AI Practitioners
The heterogeneous nature of these results proves that there is no “one-size-fits-all” defense for multimodal agents. For developers and red-teamers, the research suggests three actionable strategies:
- Backbone Selection via Geometric Robustness: When deploying agents in physical environments (e.g., robots or mobile camera systems), models like GPT-4o offer superior resilience to viewing angles and rotations, which can inadvertently act as a safety buffer against typographic injections.
- Runtime Detection via Embedding Distance: Practitioners should implement lightweight, model-agnostic monitoring at the inference layer. By calculating the L2 distance between incoming visual inputs and known adversarial text templates using normalized embeddings, systems can flag high-alignment (low-distance) inputs as high-risk.
- Visual Filtering and Legibility Thresholds: Understanding that legibility equals risk is paramount. Safety protocols must account for the fact that any text above 10px represents a high-probability vector for compliant models.
As we move toward a future defined by autonomous agents, “failure-first” safety research is no longer optional. Mapping the exact pixels that trigger a model’s compliance is the necessary first step toward building systems that can navigate the visual world without being subverted by it.
Read the full paper on arXiv · PDF