SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks
A comprehensive benchmark evaluating LLM safety across multi-turn dialogues using diverse jailbreak attack strategies and a hierarchical safety taxonomy with detailed safety dimensions.
Focus: SafeDialBench targets a critical gap in safety evaluation: most benchmarks test single-turn interactions, but real deployments expose LLMs to extended multi-turn conversations where jailbreaks evolve across turns. The benchmark applies a hierarchical safety taxonomy and seven distinct attack strategies to probe model defences systematically.
Key Insights
- Multi-turn escalation reveals new failure modes: Safety defences that hold on single-shot prompts frequently degrade across extended dialogues, as context accumulation and role-adoption reduce the model’s guard.
- Diversity of attack strategies matters: Combining persona hijack, context poisoning, and gradual escalation attacks exposes vulnerabilities invisible to any single attack vector.
- Consistency measurement: The benchmark tracks not only refusal rates but consistency — whether a model that refuses in turn 2 maintains that refusal in turn 8 when the adversarial context is reinforced.
Failure-First Relevance
Multi-turn jailbreaks are the dominant real-world attack surface — single-shot evaluations systematically underestimate risk. SafeDialBench’s hierarchical taxonomy maps directly onto the Failure-First scenario classification system, providing a principled basis for stratified sampling in benchmark packs. The consistency dimension is especially relevant to embodied AI, where a model that gradually accepts an adversarial framing over a conversation may ultimately authorise a harmful physical action.