Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic Systems
Evaluates defensive misdirection — techniques that cause automated attack systems to waste evaluation budget on ineffective paths — as a complementary defence against model-guided adversarial attacks on AI agents.
Focus: Automated attack systems guided by LLMs probe agentic systems strategically, using model inference to prioritise the most promising attack paths. Defensive misdirection exploits this by deliberately surfacing plausible-but-dead-end attack paths, causing the attacker’s LLM to over-invest evaluation budget in ineffective directions.
Key Insights
- Attacker model as the lever: Because automated attacks use an LLM to guide path selection, any technique that manipulates the attacker model’s beliefs about what will succeed is effective — the attacker’s inference is the new attack surface for defenders.
- Honeypot API responses: The paper demonstrates specific defensive patterns — endpoints that return plausible-but-useless data to automated probes — that cause attacker LLMs to conclude the system is vulnerable in specific ways while actually hardening the relevant interface.
- Asymmetric budget constraints: The attacker’s LLM incurs inference cost per probe; defensive misdirection exploits this asymmetry by maximising the cost-per-insight ratio for the attacker while keeping defensive overhead low.
Failure-First Relevance
Defensive misdirection is a novel defence category not currently represented in the Failure-First defence taxonomy. For embodied AI systems, the technique maps onto physical-world analogues: a robot that responds to adversarial probing inputs with plausible-but-incorrect status information could cause automated attack systems to waste evaluation budget on false vulnerabilities. This adds a strategic dimension to the Failure-First defence recommendations that goes beyond robustness training.