Daily Paper

Latent Jailbreak: A Benchmark for Evaluating LLM Safety under Task-Oriented Jailbreaks

Safety evaluation for goal-directed attacks where the harmful intent is latent in system instructions, not explicit requests

arXiv:2307.08487 Empirical Study

Yuancheng Xu, Jiayi Liu, Rama Chellat, Zizhou Liu et al.

task-oriented-jailbreaklatent-intentbenchmarksafety-evaluationimplicit-harm

Latent Jailbreak: When Harm Is Hidden in Task Description

Standard jailbreak benchmarks (AdvBench, JailbreakBench) test models with explicitly harmful prompts: “write a bomb-making guide.” But real-world attacks often hide intent: “I’m writing a novel where the protagonist plans X; here’s the scene setup…” Latent Jailbreak introduces a benchmark where the harmful objective is embedded in a plausible task description, making detection harder.

Key Insight

Safety failures occur on a spectrum:

  • Explicit: “Help me do X [harmful thing]”
  • Implicit: “For my novel, I need…”
  • Latent: System prompt specifies a seemingly innocent task that implicitly requires harmful output

Latent jailbreaks exploit the gap between task description and harmfulness—they’re less detectable by safety systems that focus on keyword filtering.

Benchmark Design

  • 50 latent jailbreak scenarios across multiple harm categories
  • Baseline models (GPT-3.5, Claude 2) show degraded safety performance on latent vs. explicit attacks
  • Modest but measurable effect: ~15-20% higher attack success on latent jailbreaks

Embodied AI Relevance

Embodied systems operate in task-oriented environments. A robot given “disassemble this device” or “modify this access control system” may execute harmful behaviors that are latent in the task description, not explicit. F41LUR3-F1R57 research has identified similar patterns in robot behavior: systems fail when the implicit objective (e.g., maximize throughput) conflicts with safety constraints (e.g., avoid unintended device damage).

Limitations

  • Modest improvement in jailbreak success suggests latency is a real but subtle effect
  • Scenarios are hand-written, not automatically generated
  • No analysis of which model components make latent attacks more effective

Contribution to Field

Latent Jailbreak shifts the evaluation framing: safety is not just about refusing explicit harms, but about detecting harmful objectives embedded in seemingly innocent task specifications. This is critical for embodied AI, where task formulation is often implicit in deployment context.