AI Safety Research Digest — April 16, 2026

Covering the physical AI safety frontier, red-teaming methodology crisis, and regulatory shifts.

Key Findings

  • Red-teaming is “security theater.” Feffer et al. (CMU) surveyed 104 papers and found systematic divergence across five axes: purpose, artifact, threat model, setting, and outcomes. Crowdworker-based evaluations gravitate toward easy-to-produce harms while missing complex multi-step vulnerabilities. The conflation of dissentive (context-dependent) and consentive (universally inadmissible) risks produces critical evaluation failures in physical AI.

  • 0% one-shot accuracy on physical puzzles. The CHAIN benchmark tested GPT-5.2, OpenAI-o3, and Claude-Opus-4.5 on interlocking mechanical structures. Every model scored 0.0% Pass@1. Even with iterative interaction, extreme trial-and-error inefficiency persists — GPT-5.2 costs ~$1.30 per solved task level, highlighting the economic impracticality of current physical reasoning approaches.

  • Embodied agents reject fewer than 10% of hazardous instructions. SafeAgentBench tested agents across 750 tasks and 10 hazard categories in AI2-THOR. Deceptive framing (embedding danger in plausible household requests) defeats even safety-aligned agents — the attack surface is defined by framing, not content. A dangerous divergence exists between semantic safety metrics and execution behavior.

  • AEGIS wrapper provides mathematical safety guarantees. The VLSA/AEGIS architecture uses Control Barrier Functions to project VLA outputs onto safe action sets. On SafeLIBERO: +59% obstacle avoidance, +17% task success, minimal latency overhead. Safety and capability are complementary — preventing reckless trajectories actually improves performance.

Domain-Specific Risk: Financial AI

  • FinRedTeamBench introduces Risk-Adjusted Harm Scoring (RAHS) for banking and insurance contexts. Adaptive multi-turn red-teaming with higher decoding stochasticity systematically drives models toward operationally actionable financial disclosures. Binary ASR metrics are insufficient for regulated domains.

Institutional Shifts

  • OpenAI’s safety leadership erosion continues. Superalignment team dissolved (May 2024), AGI Readiness team departed (Oct 2024), Mission Alignment team disbanded (Feb 2026). Lead Joshua Achiam transitioned to “Chief Futurist” — safety moves from operational authority to advisory influence.

Implications for Embodied AI

The Perception-Action Gap remains the central challenge. Video-generation “world models” (Sora 2, Kling 2.6, HunyuanVideo 1.5) exhibit three catastrophic failures: superficial instruction-following (moving objects through solid barriers), representational collapse (distorted geometries), and object identity failure (merging/deleting components). Visual plausibility does not equal physical integrity — a core F41LUR3-F1R57 principle.


Research sourced via NLM deep research scan. Full scan report.