Daily Paper

NRT-Bench: Benchmarking Multi-Turn Red-Teaming of LLM Operator Agents in Safety-Critical Environments

A benchmark for evaluating multi-turn red-teaming attacks specifically targeting LLM-based operator agents in safety-critical deployment settings, exposing how operator agents handle adversarial users across extended interactions.

arXiv:2606.20408 Empirical Study

Hanwool Lee, Dasol Choi, Bokyeong Kim et al.

red-teamingbenchmarkmulti-turnoperator-agentssafety-evaluation

Focus: Operator agents — LLMs deployed as intermediaries between users and systems (customer service, technical support, process automation) — face unique multi-turn red-teaming challenges because adversarial users can exploit the agent’s task context and conversation history. NRT-Bench provides the first systematic multi-turn red-teaming benchmark specifically for operator agents in safety-critical settings.

Key Insights

  • Operator context as attack surface: Unlike assistant chatbots, operator agents have access to system resources and APIs that amplify the harm potential of a successful jailbreak — an operator agent that can execute system commands or access databases makes jailbreak success much more consequential.
  • Multi-turn strategy evolution: The benchmark tracks how adversarial strategies evolve across turns, documenting specific escalation patterns (gradual persona adoption, authority escalation, task redirection) that only emerge in extended operator interactions.
  • Safety-critical deployment specificity: Safety failures in operator agents often manifest differently from general-purpose chatbots — they involve resource abuse, policy violation, and downstream system impact rather than harmful content generation.

Failure-First Relevance

NRT-Bench’s focus on operator agents is directly relevant to the Failure-First agentic scenario class — specifically multi-agent coordination scenarios where one agent acts as an operator intermediary. The escalation pattern documentation fills a gap in the Failure-First multi-turn scenario library, providing realistic adversarial trajectories for operator-context attack scenarios. The safety-critical deployment focus aligns with the Failure-First embodied AI context, where operator agents may control physical systems.