AI Safety Research Digest — May 3, 2026

Safety defences designed for text generation do not automatically transfer to systems that also move through the physical world.

Key Findings

  • VLA models face distinct safety threats that existing NLP defences may not transfer to. Li et al. (Apr 26, 2026) survey threats against Vision-Language-Action models — multimodal systems that translate perception and language into physical robot actions. The paper catalogues attack vectors including data poisoning, adversarial patches, cross-modal perturbations, and semantic jailbreaks, noting that the irreversible physical consequences of VLA failures require tailored defences rather than direct adaptation from text-only safety work. Link

  • Separating intent, authorization, and execution may offer structural safety guarantees for agents. Xiang (Apr 28, 2026) proposes a Policy-Execution-Authorization architecture that decouples three functions: policy sets constraints, authorization approves each action, and execution carries it out. The argument is that structural separation creates enforcement that does not depend on learned ethical representations, which remain susceptible to adversarial manipulation. Link

  • LLM agent memory is vulnerable to inference-time privacy extraction that bypasses safety filters. Cui et al. (Apr 28, 2026) demonstrate Spore, a training-free attack that probes agent memory at inference time using hybrid techniques. The attack achieves high query efficiency against safety-aligned models, suggesting alignment may protect output-layer generation while leaving stored-context retrieval inadequately defended. Link

  • Automated red-teaming pipelines can be optimized end-to-end with evolutionary selection. Yuan et al. (Jan 20, 2026) frame attacker policy design as a system-optimization problem and use evolutionary search over LLM agents to find higher-performing configurations. Evaluated on HarmBench, AgenticRed achieves superior attack success rates compared to fixed attacker designs, suggesting defensive benchmarks also need to model adaptive adversaries. Link

  • Multi-turn dialogue introduces safety gaps that single-turn benchmarks miss. Cao et al. (Feb 7, 2026) present SafeDialBench, covering diverse jailbreak strategies across extended conversations. Models that hold up against single-turn attacks show measurably different behaviour across multi-turn dialogues — a pattern consistent with the gradual-escalation failure mode that persistence testing is designed to surface. Link

Implications for Embodied AI

The VLA safety survey is the week’s most directly relevant paper for the failure-first programme. Its attack catalogue — data poisoning at training, adversarial patches at inference, semantic jailbreaks through the language channel — maps closely to the threat taxonomy the framework evaluates. The key implication is that text-model defences require reformulation before they can protect physically-actuated systems; the cost function for failure is qualitatively different when an incorrect action is irreversible.

The Spore privacy attack surfaces a distinct risk category: agents with persistent memory may expose prior interaction context even when output safety filters remain intact. For deployments combining long-context memory with physical action capability, this creates a compounding risk profile worth incorporating into threat modelling alongside the more commonly evaluated output-generation attacks.

The structural-separation argument extends the design question the programme tracks: how much of a system’s safety can be structurally enforced versus how much depends on learned representations susceptible to manipulation. A Policy-Execution-Authorization approach is more auditable and may degrade more gracefully under adversarial pressure, but tradeoffs against agent flexibility remain an empirical question requiring testing on open-domain tasks beyond the structured scenarios evaluated in the paper.


Baseline generation — paper discovery via Hugging Face/arXiv. NLM-augmented assets (audio/infographic/video) added by local pipeline when available.