December 12, 2025 Daily Paper

State-Dependent Safety Failures in Multi-Turn Language Model Interaction

Introduces STAR, a state-oriented diagnostic framework showing that multi-turn safety failures arise from structured contextual state evolution rather than isolated prompt vulnerabilities, with mechanistic evidence of monotonic drift away from refusal representations and abrupt phase transitions.

arXiv:2603.15684 Methods

Pengcheng Li, Jie Zhang, Tianwei Zhang, Han Qiu et al.

multi-turn-attackssafety-alignmentstate-transitionsconversational-safetyphase-transitionsmechanistic-interpretabilityrefusal-drift

State-Dependent Safety Failures in Multi-Turn Language Model Interaction

Focus: Li et al. reframe multi-turn safety failures as a state-space problem rather than a prompt engineering problem. Their STAR framework treats dialogue history as a state transition operator, enabling controlled analysis of how aligned models traverse the safety boundary under autoregressive conditioning. The key finding: models that appear robust under static single-turn evaluation undergo rapid and reproducible safety collapse under structured multi-turn interaction, driven by monotonic drift away from refusal-related representations and abrupt phase transitions induced by role-conditioned context.

Key Insights

Safety is a dynamic, state-dependent property. Static evaluation of isolated queries systematically overestimates model safety. The state-space perspective reveals that safety alignment degrades as a function of conversational trajectory, not just prompt content.
Monotonic drift toward compliance. Mechanistic analysis shows that internal representations drift steadily away from refusal-related features across turns. This is not random fluctuation but a structured, predictable process that attackers can exploit.
Abrupt phase transitions exist. Role-conditioned context can induce sudden shifts from refusal to compliance, analogous to phase transitions in physical systems. This means safety does not degrade gracefully — it can collapse abruptly at a critical point in the conversation.
Diagnostic framework, not attack optimization. STAR is designed to probe and understand safety boundaries rather than maximize attack success, providing a principled tool for safety evaluation rather than just another jailbreak technique.

Failure-First Relevance

This paper provides mechanistic evidence for patterns we have observed empirically in our multi-turn benchmarking: models that pass single-turn safety evaluations fail under sustained conversational pressure. The state-transition framing aligns with our episode- based evaluation approach, where 5-10 scene sequences test stateful degradation. The finding of monotonic drift and phase transitions offers a theoretical foundation for why multi-turn attacks like crescendo patterns and Foot-In-The-Door techniques work — they are exploiting a structural property of autoregressive conditioning, not just finding clever prompt wordings. The diagnostic (rather than attack-optimizing) orientation makes STAR a potential complement to our FLIP grading methodology.

Open Questions

Can the monotonic drift be detected at inference time to implement dynamic safety monitoring that triggers re-evaluation or conversation reset before phase transitions occur?
Do different model families exhibit different state-space trajectories, or is the drift pattern universal across architectures and training approaches?
How does the state-transition analysis extend to embodied agents where conversational context includes physical state observations and action history?

Read the full paper on arXiv · PDF