BadRobot: Jailbreaking Embodied LLM Agents in the Physical World
Demonstrates that voice-based attacks can jailbreak embodied LLM-powered robots to execute harmful physical actions, exploiting vulnerabilities in robot behaviour execution and world knowledge application.
Focus: BadRobot is the first systematic study of jailbreaking embodied LLM agents through voice-based natural-language attacks. Unlike text-based jailbreaks, voice attacks on robots can trigger immediate physical harm — the paper shows that standard LLM safety guardrails provide insufficient protection when the model controls actuators.
Key Insights
- Physical harm is the new threat model: Text jailbreaks produce harmful outputs; embodied jailbreaks produce harmful actions. The severity difference is qualitative, not just quantitative — a robot that executes a dangerous physical trajectory in response to a jailbreak instruction causes real-world harm that cannot be undone.
- World-knowledge flaws as attack surface: Robots that consult an LLM for world-knowledge (e.g., “is this substance dangerous?”) can be misled by jailbreaks that corrupt the model’s factual retrieval, leading to safety-critical misclassifications.
- Safety constraint bypass via role adoption: Assigning the robot a “helpful assistant” persona that is instructed to always comply can override the safety-instruction hierarchy, particularly when the safety constraints are embedded in a system prompt rather than hard-coded in the action policy.
Failure-First Relevance
BadRobot is foundational to the Failure-First unsafe_action_elicitation metric class (#835). The voice-based attack surface it identifies maps directly onto PiCar-X’s natural-language interface (the px-wake-listen service). The distinction between “text harm” and “physical harm” underlies the entire VLA safety taxonomy — a paper that quantifies how standard LLM guardrails fail when coupled to actuators is essential context for the Failure-First embodied benchmark scope.