Daily Paper

SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks

A comprehensive benchmark evaluating LLM safety across multi-turn dialogues using diverse jailbreak attack strategies and a hierarchical safety taxonomy with detailed safety dimensions.

arXiv:2502.11090 Empirical Study

Hongye Cao, Sijia Jing, Yanming Wang, Ziyue Peng et al.

jailbreaksafety-evaluationmulti-turnbenchmarksafety-alignment

Focus: SafeDialBench targets a critical gap in safety evaluation: most benchmarks test single-turn interactions, but real deployments expose LLMs to extended multi-turn conversations where jailbreaks evolve across turns. The benchmark applies a hierarchical safety taxonomy and seven distinct attack strategies to probe model defences systematically.

Key Insights

  • Multi-turn escalation reveals new failure modes: Safety defences that hold on single-shot prompts frequently degrade across extended dialogues, as context accumulation and role-adoption reduce the model’s guard.
  • Diversity of attack strategies matters: Combining persona hijack, context poisoning, and gradual escalation attacks exposes vulnerabilities invisible to any single attack vector.
  • Consistency measurement: The benchmark tracks not only refusal rates but consistency — whether a model that refuses in turn 2 maintains that refusal in turn 8 when the adversarial context is reinforced.

Failure-First Relevance

Multi-turn jailbreaks are the dominant real-world attack surface — single-shot evaluations systematically underestimate risk. SafeDialBench’s hierarchical taxonomy maps directly onto the Failure-First scenario classification system, providing a principled basis for stratified sampling in benchmark packs. The consistency dimension is especially relevant to embodied AI, where a model that gradually accepts an adversarial framing over a conversation may ultimately authorise a harmful physical action.