AI Safety Research Digest — May 26, 2026

The threat surface is now taxonomised; the harder question is whether our diagnostic tools are reliable enough to navigate it.

Key Findings

  • Embodied AI safety now has a comprehensive cross-pipeline threat taxonomy. A May 2026 survey (arXiv:2605.02900) maps the full vulnerability landscape across the embodied AI stack — perception, cognition, planning, action, and human-agent interaction — covering adversarial, backdoor, and jailbreak attacks alongside certified robustness and safe training defences. The cross-layer framing is methodologically significant: prior work fragmented the threat surface by pipeline stage, while this survey argues for unified evaluation that traces attack pathways across stage boundaries rather than treating each in isolation.

  • Sparse autoencoder feature steering is fragile for safety-critical applications. Research from January 2026 (arXiv:2601.03047) evaluates mechanistic interpretability via sparse autoencoders and finds that feature extraction shows promise but feature steering — the safety-relevant intervention — is substantially less reliable. Features learned to represent benign concepts activate unexpectedly on semantically distant, harmful-adjacent contexts (the paper’s title references a “coffee feature” that activates on coffins). The authors caution that SAE-based steering is not yet reliable enough for safety-critical deployment.

  • Standard benchmarks systematically obscure agent reliability failures. A February 2026 paper (arXiv:2602.16666) argues that single-score leaderboard evaluations hide four critical reliability dimensions: consistency, robustness, predictability, and safety. The authors propose performance profiling over failure distributions rather than mean task accuracy — a methodological shift that would surface failure modes that current evaluations structurally cannot detect.

  • Diagnostic guardrails outperform reactive monitoring for agentic safety. AgentDoG (arXiv:2601.18491) introduces a three-dimensional taxonomy of agentic safety risks and a guardrail framework providing root-cause attribution for agent trajectory failures rather than binary pass/fail flags. Evaluated across multiple model variants, it achieves fine-grained monitoring with interpretable failure attribution — a capability gap that existing agent safety tooling leaves largely unaddressed.

Implications for Embodied AI

The embodied AI safety survey (2605.02900) provides a taxonomic scaffold for the failure-first corpus. Its cross-pipeline framing — treating attack pathways as causal chains across perception, cognition, planning, and action rather than isolated layer failures — aligns with the multi-turn, trajectory-level evaluation philosophy the framework employs. The key empirical gap the survey identifies, unified evaluation that traces consequence chains across pipeline boundaries, is precisely what the episode format is designed to close.

The SAE steering fragility result (2601.03047) has direct implications for any safety pipeline that relies on interpretability tooling for detection or intervention. If SAE features activate on semantically distant contexts, then feature-presence monitors will generate false positives and false negatives on exactly the ambiguous cases that matter most in practice. This provides independent empirical grounding for the programme’s reliance on LLM-based semantic graders over activation or keyword heuristics.

The reliability science paper (2602.16666) and AgentDoG (2601.18491) together suggest a direction for the next benchmark generation: replace single-score evaluation with failure-distribution profiling, paired with diagnostic attribution that identifies why an agent failed rather than simply that it failed. The distance between current safety benchmarks and operationally useful safety assurance runs precisely along this axis — and it is a distance the failure-first programme’s per-turn grading and trajectory labelling schema is structured to reduce.