OpenRT: An Open-Source Red Teaming Framework for Multimodal Large Language Models
A unified, modular red-teaming framework for evaluating multimodal LLM safety through adversarial testing across multiple attack dimensions including visual, textual, and cross-modal attack strategies.
Focus: OpenRT addresses the fragmentation in multimodal red-teaming by providing a single, composable framework with modular adversarial kernels, attack strategies, judging methods, and evaluation metrics. It unifies disparate attack implementations and enables consistent comparison across models and attack types.
Key Insights
- Modular adversarial kernel: Separating the attack primitive (kernel) from the strategy (how attacks are composed and adapted) allows mix-and-match evaluation without re-implementing from scratch for each model.
- Multi-agent evolutionary attacks: OpenRT supports evolutionary search over the attack space, where failed attempts inform the mutation of subsequent prompts — improving attack success rate over static baselines.
- Standardised ASR metrics: Inconsistent success-rate definitions have historically made cross-paper comparison unreliable; OpenRT enforces a uniform evaluation protocol.
Failure-First Relevance
Open-source red-teaming infrastructure is essential for reproducible safety research. OpenRT’s multimodal scope is directly relevant to vision-language-action models, where cross-modal attacks (adversarial image patches combined with textual misdirection) represent an under-studied but high-risk failure mode. The modular architecture aligns with the Failure-First pipeline philosophy: each attack dimension can be tested independently before combining into compound operators.