Daily Paper

Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations

A hierarchical attention mechanism for detecting multi-turn jailbreaks across long conversation histories, addressing the context-length limitations that prevent standard classifiers from tracking adversarial escalation across extended dialogues.

Chenhui Hu, Muhammed Salih, Sudipto Guha et al.

jailbreakdetectionmulti-turnsafety-evaluationlong-context

Focus: Standard jailbreak detectors classify each turn independently, missing adversarial escalation patterns that span many turns. This paper introduces a hierarchical attention mechanism that operates at both the turn level and the conversation level, capturing cross-turn adversarial dynamics that single-turn classifiers miss.

Key Insights

  • Hierarchical temporal modelling: The detector uses turn-level attention to identify safety-relevant signals within each message, then conversation-level attention to model how those signals evolve and compound across the dialogue history.
  • Context-efficient architecture: Long conversation histories are computationally expensive for full attention; the hierarchical approach compresses turn representations before applying cross-turn attention, making detection tractable over many turns.
  • Escalation pattern recognition: The model learns to detect specific escalation patterns — gradual persona adoption, context poisoning, and multi-step reasoning attacks — that are only visible at the conversation level.

Failure-First Relevance

Multi-turn jailbreak detection is a critical component of any deployed safety system that handles extended conversations. For embodied AI applications, the conversation history tracking capability maps onto the Failure-First episode-level scenario class, where safety failures are defined by trajectory patterns rather than individual turns. The hierarchical architecture is directly applicable to detecting latent continuation patterns that unfold across episode steps.