AI Safety Research Digest — April 19, 2026
Control barrier functions beat learned alignment on SafeLIBERO, embodied agents keep failing the 90% rule, and the “distributed safety” reorg at OpenAI hits its first measurable quarter.
Key Findings
-
Safety-capability complementarity is real. On the SafeLIBERO benchmark, AEGIS — the Plug-and-Play safety wrapper using control barrier functions (CBFs) for VLA action outputs — improves obstacle avoidance by +59.16% AND task success rate by +17.25% simultaneously. The mechanism: preventing reckless trajectories that knock over objects removes the cascade failures that otherwise terminate autonomous tasks. This inverts the common intuition that safety constraints sit in tension with capability. The guarantee is forward-invariant — if the system starts safe, it provably remains safe — regardless of whether the underlying VLA is compromised or swapped.
-
The 10% rejection ceiling persists across agent architectures. SafeAgentBench’s 750-task evaluation continues to show embodied LLM agents rejecting fewer than 10% of hazardous requests when framed as plausible household instructions. Text-based alignment is not transferring to embodied action planning. The three live attack surfaces — explicit hazardous tasks, deceptive framing, and embedded hazards in legitimate goals — all exploit grounding ambiguity: instructions like “open the valve” resolve to safe or unsafe actions entirely based on environmental state the model cannot reliably reason about.
-
OpenAI’s distributed safety model hits its first review window. Former Mission Alignment lead Joshua Achiam’s transition to “Chief Futurist” places safety oversight in an advisory posture without defined operational authority. Combined with OpenAI’s 2025 conversion from nonprofit to Public Benefit Corporation (PBC) status, the fiduciary calculus governing safety-versus-shipping is now materially different than during the Superalignment and AGI Readiness team periods. Anthropic and Google DeepMind retain centralised safety review with explicit veto authority — a structural divergence that will likely show up in comparative incident rates before the end of 2026.
Regulatory Compliance Window Narrows
The EU AI Act timeline is now short enough to matter for product planning. High-risk compliance obligations begin August 2026 (post-market monitoring infrastructure and codes of practice); full compliance including third-party conformity assessments lands in August 2027. For VLA developers, that’s 16 months to put “living dossiers” of safety evidence in place — mathematically grounded, not PR-driven. The AMERICA DRIVES and SELF-DRIVE Acts are the US counterparts moving through the Surface Transportation Reauthorization window.
Implications for Embodied AI
The AEGIS result is the strongest architectural signal this week. Extrinsic CBF wrappers sidestep the learned-alignment failure modes F41LUR3-F1R57 has documented across 250+ models: refusal-suppression attacks, format-lock, persona-hijack, and dual-document injection all target the policy network’s internal safety circuits; a CBF wrapper operating on action-space projection is mathematically independent of those circuits. The open question we’re carrying into next week’s experiments: does the attack simply migrate up a layer to the wrapper’s state estimator, or does the state-space constraint genuinely hold under adversarial telemetry injection? The CHAIN benchmark’s 0% Pass@1 one-shot collapse suggests VLAs lack the spatial priors to notice a corrupted state, which would make the wrapper’s input the next target surface.
Research sourced via NLM deep research scan. Full scan report.