Daily Paper

Crescendo: Multi-Turn LLM Jailbreak Attack with Adaptive Queries

Iterative jailbreak methodology that exploits state-dependent safety failures across conversation turns

arXiv:2404.01833 Empirical Study

Mark Russinovich, Chloe Loughridge, Emanuele Calli, Jamin Chen et al.

multi-turn-attackiterative-jailbreakstate-dependent-safetyconversation-contextadaptive-queries

Crescendo: Jailbreaking Through Conversational Drift

Crescendo introduces an adaptive, multi-turn jailbreak strategy that exploits how safety constraints degrade over conversation context. Rather than attempting a one-shot jailbreak, Crescendo uses a series of seemingly innocent requests that progressively shift the model’s context window toward harmful output.

Attack Strategy

  1. Establish rapport: First turns establish benign interaction
  2. Inject premise: Subtle introduction of harmful assumptions
  3. Reframe context: Position the model to interpret subsequent requests differently
  4. Escalate gradually: Each turn builds on the previous, making refusal increasingly awkward
  5. Trigger execution: Final request is granted within the primed context

Key Findings

  • Multi-turn jailbreaks succeed at 2-5x higher rates than single-turn attacks
  • Safety guardrails degrade predictably over conversation length
  • Models show “consistency pressure”—reluctance to refuse when previous responses were permissive
  • State-dependent failures are systematic, not rare

Embodied AI Connection

F41LUR3-F1R57 research has identified state-dependent safety failures as a dominant failure mode in embodied systems. Crescendo directly mirrors this: a robot’s safety constraints should not degrade based on conversation history, but they empirically do. Multi-turn task specifications that gradually shift toward unsafe behavior are harder to detect and refuse than single explicit requests.

Practical Impact

Crescendo demonstrates that multi-turn safety evaluation is not equivalent to repeated single-turn evaluation. This is critical for deployed embodied systems that operate over hours or days—the cumulative effect of state degradation can be substantial.

Limitations

  • Crescendo uses fixed strategies; adversarial adaptation might make attacks even more effective
  • Analysis focuses on text-only models; unclear if multimodal embodied systems show similar degradation

Implications

Safety evaluation of embodied AI systems must include multi-turn interaction tests where context gradually shifts toward potential harm.