Daily Paper

OTTER: A Red-Teaming System for Toxicity-Evading Jailbreak Prompt Optimization

OTTER is a red-teaming system that generates jailbreak prompts specifically designed to evade toxicity detectors while maintaining attack effectiveness, exploiting the semantic gap between toxicity detection and safety alignment.

Jerry Wang, Hsin-Ling Hsu, Yi-Cheng Lai et al.

red-teamingjailbreakoptimizationtoxicity-detectionadversarial-attacks

Focus: OTTER targets a specific practical gap: safety systems that layer a toxicity detector on top of an LLM can be bypassed by attacks that produce non-toxic intermediates that nonetheless guide the model to harmful outputs. OTTER systematically searches for such attacks, directly challenging the defence-in-depth assumption that toxicity detection and alignment are complementary.

Key Insights

  • Toxicity vs. harmfulness divergence: A prompt can be formally non-toxic (no profanity, no slurs, no explicit content) while still eliciting harmful outputs — OTTER exploits this gap by generating attacks that score low on toxicity classifiers but high on the downstream harm objective.
  • Multi-objective optimisation: OTTER optimises for both attack effectiveness (maximises harmful output) and toxicity evasion (minimises toxicity classifier score) simultaneously, using a Pareto frontier approach to find attacks that satisfy both objectives.
  • Transferability to production systems: The paper evaluates OTTER against production moderation APIs (not just research classifiers), demonstrating practical threat relevance.

Failure-First Relevance

OTTER is a practical demonstration of why the Failure-First pipeline rejects heuristic classifiers (Operating Rule #21: “Heuristic classifiers lie”). Toxicity detection is a heuristic classifier — it measures response style, not semantic harm. OTTER provides adversarial evidence that defence systems built on toxicity detection can be systematically bypassed, reinforcing the Failure-First requirement for LLM-based grading rather than keyword or toxicity-score-based grading.