Daily Paper

State-Dependent Safety Failures in Multi-Turn Language Model Interaction

Introduces STAR, a state-oriented diagnostic framework showing that multi-turn safety failures arise from structured contextual state evolution rather than isolated prompt vulnerabilities, with mechanistic evidence of monotonic drift away from refusal representations and abrupt phase transitions.

Pengcheng Li, Jie Zhang, Tianwei Zhang, Han Qiu et al.

multi-turn-attackssafety-alignmentstate-transitionsconversational-safetyphase-transitionsmechanistic-interpretabilityrefusal-drift

State-Dependent Safety Failures in Multi-Turn Language Model Interaction

Focus: Li et al. reframe multi-turn safety failures as a state-space problem rather than a prompt engineering problem. Their STAR framework treats dialogue history as a state transition operator, enabling controlled analysis of how aligned models traverse the safety boundary under autoregressive conditioning. The key finding: models that appear robust under static single-turn evaluation undergo rapid and reproducible safety collapse under structured multi-turn interaction, driven by monotonic drift away from refusal-related representations and abrupt phase transitions induced by role-conditioned context.


Key Insights

  • Safety is a dynamic, state-dependent property. Static evaluation of isolated queries systematically overestimates model safety. The state-space perspective reveals that safety alignment degrades as a function of conversational trajectory, not just prompt content.

  • Monotonic drift toward compliance. Mechanistic analysis shows that internal representations drift steadily away from refusal-related features across turns. This is not random fluctuation but a structured, predictable process that attackers can exploit.

  • Abrupt phase transitions exist. Role-conditioned context can induce sudden shifts from refusal to compliance, analogous to phase transitions in physical systems. This means safety does not degrade gracefully — it can collapse abruptly at a critical point in the conversation.

  • Diagnostic framework, not attack optimization. STAR is designed to probe and understand safety boundaries rather than maximize attack success, providing a principled tool for safety evaluation rather than just another jailbreak technique.

Failure-First Relevance

This paper provides mechanistic evidence for patterns we have observed empirically in our multi-turn benchmarking: models that pass single-turn safety evaluations fail under sustained conversational pressure. The state-transition framing aligns with our episode- based evaluation approach, where 5-10 scene sequences test stateful degradation. The finding of monotonic drift and phase transitions offers a theoretical foundation for why multi-turn attacks like crescendo patterns and Foot-In-The-Door techniques work — they are exploiting a structural property of autoregressive conditioning, not just finding clever prompt wordings. The diagnostic (rather than attack-optimizing) orientation makes STAR a potential complement to our FLIP grading methodology.

Open Questions

  • Can the monotonic drift be detected at inference time to implement dynamic safety monitoring that triggers re-evaluation or conversation reset before phase transitions occur?

  • Do different model families exhibit different state-space trajectories, or is the drift pattern universal across architectures and training approaches?

  • How does the state-transition analysis extend to embodied agents where conversational context includes physical state observations and action history?


Read the full paper on arXiv · PDF