February 27, 2026 Daily Paper

OpenRT: An Open-Source Red Teaming Framework for Multimodal Large Language Models

A unified, modular red-teaming framework for evaluating multimodal LLM safety through adversarial testing across multiple attack dimensions including visual, textual, and cross-modal attack strategies.

arXiv:2601.01592 Methods

Xin Wang, Yunhao Chen, Juncheng Li, Yixu Wang et al.

red-teamingmultimodalsafety-evaluationopen-sourceadversarial-attacks

Focus: OpenRT addresses the fragmentation in multimodal red-teaming by providing a single, composable framework with modular adversarial kernels, attack strategies, judging methods, and evaluation metrics. It unifies disparate attack implementations and enables consistent comparison across models and attack types.

Key Insights

Modular adversarial kernel: Separating the attack primitive (kernel) from the strategy (how attacks are composed and adapted) allows mix-and-match evaluation without re-implementing from scratch for each model.
Multi-agent evolutionary attacks: OpenRT supports evolutionary search over the attack space, where failed attempts inform the mutation of subsequent prompts — improving attack success rate over static baselines.
Standardised ASR metrics: Inconsistent success-rate definitions have historically made cross-paper comparison unreliable; OpenRT enforces a uniform evaluation protocol.

Failure-First Relevance

Open-source red-teaming infrastructure is essential for reproducible safety research. OpenRT’s multimodal scope is directly relevant to vision-language-action models, where cross-modal attacks (adversarial image patches combined with textual misdirection) represent an under-studied but high-risk failure mode. The modular architecture aligns with the Failure-First pipeline philosophy: each attack dimension can be tested independently before combining into compound operators.