AI Safety Research Digest — April 18, 2026
Covering the perception–action gap in physical AI, the architectural shift toward extrinsic safety wrappers, and red-teaming integrity as the field tries to move past “security theater.”
Key Findings
-
GPT-5.2 scores 0% Pass@1 on interlocking mechanical puzzles. The CHAIN (Causal Hierarchy of Actions and Interactions) benchmark evaluates models in interactive 3D physics environments where early actions irrevocably restrict the future feasible state-space. In “one-shot” settings with no environmental feedback, Pass@1 accuracy for Kongming/Lu Ban-style puzzles collapsed to 0.0% across all flagship models, including GPT-5.2 and OpenAI-o3. GPT-5.2’s stacking success rate also dropped from 31.2% in interactive mode to just 9.1% in one-shot — evidence that current physical priors cannot substitute for iterative trial-and-error feedback.
-
Video “world models” exhibit three structural failure modes. Assessments of Sora 2, Wan 2.6, Kling 2.6, and HunyuanVideo 1.5 document: object identity failure (items added, removed, or merged mid-sequence — prevalent in Sora 2); representational collapse (distorted geometry violating object rigidity — documented in HunyuanVideo 1.5 and Kling 2.6); and superficial instruction-following (solving disassembly tasks by translating objects through solid barriers in violation of collision rules — observed in Sora 2 and Wan 2.6). Visual plausibility without structural integrity is not a world model.
-
AEGIS/VLSA wrappers enforce safety extrinsically via control barrier functions. The “Safety as a Wrapper” paradigm — “do not ask the model to be safe, prevent it from being unsafe” — intercepts a VLA’s action outputs and projects them onto a mathematically defined safe set. On SafeLIBERO, a base VLA augmented with AEGIS shows +59.16% obstacle avoidance and +17.25% task success with minimal inference overhead. Because the safety layer is mathematically independent of the VLA’s internal state, it resists adversarial prompt injection and weight corruption in a way learned constraints do not.
-
Embodied LLM agents reject fewer than 10% of hazardous requests. SafeAgentBench evaluates 750 tasks across 10 categories in live simulation, with dual execution-based and semantic-based scoring. The study surfaces multi-step hazard emergence (innocuous step sequences combining into hazardous states — e.g., moving boxes that eventually block an emergency exit) and deceptive framing (agents that refuse overtly “dangerous” requests comply when the hazard is embedded in a plausible household instruction). Text-based safety alignment is failing to transfer to embodied action planning.
Red-Teaming Integrity
Fragmented methodology is degrading the signal from safety evaluations. Analysts now distinguish dissentive risks (context-dependent, e.g., a robot refusing to make a sandwich) from consentive risks (universally inadmissible physical actions, e.g., a robot arm striking a human). The Risk-Adjusted Harm Score (RAHS) framework for the BFSI sector quantifies operational severity by disclosure specificity, mitigation-signal presence, and inter-judge agreement. March 2026 financial red-teaming data reinforces two patterns the F41LUR3-F1R57 corpus has documented repeatedly: decoding stochasticity correlates with jailbreak success (higher temperature = higher ASR), and sustained adaptive interaction lets attackers iteratively escalate beyond what single-turn tests can detect.
Governance and Regulatory Shifts
- OpenAI Mission Alignment team disbanded (February 2026). Staff redistributed to a “distributed safety model” across product teams; former lead Joshua Achiam moved to a Chief Futurist role. This follows the Superalignment team dissolution (May 2024) and AGI Readiness team departure (October 2024). Embedding safety specialists within product lines risks subordinating safety review to shipping velocity.
- EU AI Act milestones bear down on physical AI. High-risk compliance for embodied agents and VLAs begins August 2026; full compliance (including third-party conformity assessments) due August 2027. US counterparts — the AMERICA DRIVES and SELF-DRIVE Acts — propose a national safety-data repository and autonomous-vehicle cybersecurity baselines.
Implications for Embodied AI
The CHAIN and SafeAgentBench results land in exactly the failure region F41LUR3-F1R57 is designed to characterize: text-layer safety alignment does not transfer to the action layer, and benchmark scores that look “safe” in single-turn text don’t survive contact with a physics engine or a multi-step household scenario. The AEGIS/VLSA result is the more hopeful counterpoint — a mathematically defensible safety guarantee that holds independent of the policy model’s internal state. Two follow-on questions for the corpus: (1) does an external CBF wrapper survive the dual-document and format-lock attacks we documented on VLAs last sprint, or does the attack simply move up a layer to the wrapper’s state estimator; and (2) how does RAHS-style operational scoring integrate with our existing FLIP-graded COMPLIANCE/PARTIAL/REFUSAL rubric for embodied tasks where the harm is multi-step rather than single-utterance.
Research sourced via NLM deep research scan. Full scan report.