AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
Proposes AutoDAN, a gradient-based method for generating interpretable adversarial jailbreak prompts that combines readability with attack effectiveness, achieving high success rates against aligned LLMs while producing human-understandable attack text.
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
Focus: Zhu et al. developed AutoDAN, which uses gradient-based optimization to generate jailbreak prompts that are both effective and human-readable, bridging the gap between unreadable GCG-style adversarial suffixes and manually crafted jailbreak prompts. This interpretability enables analysis of why certain attack structures succeed.
Key Insights
-
Interpretable attacks reveal safety training weaknesses. Unlike GCG attacks that produce meaningless token sequences, AutoDAN generates readable adversarial prompts. This readability enables researchers to analyze why specific linguistic structures bypass safety training, turning offensive capability into defensive knowledge.
-
Gradient guidance with readability constraints. AutoDAN uses token-level gradients to optimize attack effectiveness while applying readability constraints to maintain fluency. This dual optimization produces prompts that combine the effectiveness of automated search with the stealthiness of human-crafted attacks.
-
Transferability and defense evasion. The generated prompts transfer across models and evade perplexity-based detection filters because they consist of natural language rather than adversarial token noise.
Executive Summary
AutoDAN addressed a fundamental limitation of prior automated jailbreak generation methods.
The Problem
Two prior approaches existed, each with significant limitations:
-
GCG-style gradient attacks: High success rates but produced unintelligible adversarial suffixes (e.g.,
"beschriebene describing()!-- Sure"]) easily detected by perplexity filters. -
Manual jailbreaks: Readable and stealthy but could not be automatically scaled or systematically optimized.
The AutoDAN Approach
AutoDAN bridged this gap through:
-
Initialization. Starting from human-readable jailbreak templates as seed text.
-
Gradient computation. Computing token-level gradients indicating which replacements would most increase the target model’s probability of producing harmful output.
-
Readability filtering. Constraining candidate replacements to maintain semantic coherence and natural language fluency.
-
Iterative refinement. Repeating the process to converge on high-effectiveness, high-readability attack prompts.
Results
Evaluated against Llama-2 and Vicuna, AutoDAN achieved attack success rates significantly higher than manual jailbreaks while maintaining readability that defeated perplexity-based defenses. The generated prompts typically used:
- Role-playing scenarios
- Fictional framings
- Research justifications
- Authority figures and expert personas
Convergent Attack Patterns
By examining which linguistic patterns the optimization converged on, researchers could identify systematic weaknesses in safety training. Patterns like persona assignment, authority escalation, and explicit constraint dismissal appeared consistently across optimized attacks — matching the strategies human red-teamers independently discover.
Relevance to Failure-First
AutoDAN exemplifies the failure-first principle that attack research produces defensive knowledge:
-
Interpretable attacks are analytically valuable. The readable nature of AutoDAN prompts reveals which structural patterns safety training fails to robustly address.
-
Cross-model vulnerability. Transferability of readable adversarial prompts validates the framework’s cross-model evaluation approach and suggests shared vulnerability patterns.
-
Automated red-teaming at scale. The framework’s benchmark pipeline can incorporate AutoDAN-style generation to continuously discover new attack vectors.
Read the full paper on arXiv · PDF