NRT-Bench: Benchmarking Multi-Turn Red-Teaming of LLM Operator Agents in Safety-Critical Environments
A benchmark for evaluating multi-turn red-teaming attacks specifically targeting LLM-based operator agents in safety-critical deployment settings, exposing how operator agents handle adversarial users across extended interactions.
Focus: Operator agents — LLMs deployed as intermediaries between users and systems (customer service, technical support, process automation) — face unique multi-turn red-teaming challenges because adversarial users can exploit the agent’s task context and conversation history. NRT-Bench provides the first systematic multi-turn red-teaming benchmark specifically for operator agents in safety-critical settings.
Key Insights
- Operator context as attack surface: Unlike assistant chatbots, operator agents have access to system resources and APIs that amplify the harm potential of a successful jailbreak — an operator agent that can execute system commands or access databases makes jailbreak success much more consequential.
- Multi-turn strategy evolution: The benchmark tracks how adversarial strategies evolve across turns, documenting specific escalation patterns (gradual persona adoption, authority escalation, task redirection) that only emerge in extended operator interactions.
- Safety-critical deployment specificity: Safety failures in operator agents often manifest differently from general-purpose chatbots — they involve resource abuse, policy violation, and downstream system impact rather than harmful content generation.
Failure-First Relevance
NRT-Bench’s focus on operator agents is directly relevant to the Failure-First agentic scenario class — specifically multi-agent coordination scenarios where one agent acts as an operator intermediary. The escalation pattern documentation fills a gap in the Failure-First multi-turn scenario library, providing realistic adversarial trajectories for operator-context attack scenarios. The safety-critical deployment focus aligns with the Failure-First embodied AI context, where operator agents may control physical systems.