OTTER: A Red-Teaming System for Toxicity-Evading Jailbreak Prompt Optimization
OTTER is a red-teaming system that generates jailbreak prompts specifically designed to evade toxicity detectors while maintaining attack effectiveness, exploiting the semantic gap between toxicity detection and safety alignment.
Focus: OTTER targets a specific practical gap: safety systems that layer a toxicity detector on top of an LLM can be bypassed by attacks that produce non-toxic intermediates that nonetheless guide the model to harmful outputs. OTTER systematically searches for such attacks, directly challenging the defence-in-depth assumption that toxicity detection and alignment are complementary.
Key Insights
- Toxicity vs. harmfulness divergence: A prompt can be formally non-toxic (no profanity, no slurs, no explicit content) while still eliciting harmful outputs — OTTER exploits this gap by generating attacks that score low on toxicity classifiers but high on the downstream harm objective.
- Multi-objective optimisation: OTTER optimises for both attack effectiveness (maximises harmful output) and toxicity evasion (minimises toxicity classifier score) simultaneously, using a Pareto frontier approach to find attacks that satisfy both objectives.
- Transferability to production systems: The paper evaluates OTTER against production moderation APIs (not just research classifiers), demonstrating practical threat relevance.
Failure-First Relevance
OTTER is a practical demonstration of why the Failure-First pipeline rejects heuristic classifiers (Operating Rule #21: “Heuristic classifiers lie”). Toxicity detection is a heuristic classifier — it measures response style, not semantic harm. OTTER provides adversarial evidence that defence systems built on toxicity detection can be systematically bypassed, reinforcing the Failure-First requirement for LLM-based grading rather than keyword or toxicity-score-based grading.