AI Safety Research Digest — May 2, 2026

Five papers converge on a structural thesis: model-level alignment is necessary but insufficient. Safety in deployed systems requires architectural separation, sovereignty boundaries, population-level dynamics, and formal verification — none of which emerge from training alone.

Key Findings

  • AI safety should be reframed as control of irreversibility, not output correctness. Shu and Wei introduce “decision-energy density” — the rate-weighted capacity of a node to generate, evaluate, and execute consequential decisions — and prove a Boundary Stabilization Theorem: safety requires preventing irreversible power concentration in a single high-efficiency node, not proving the system is always correct. The framework defines three sovereignty boundaries: irreversible decision authority (B1), physical resource mobilization (B2), and self-expansion authority (B3). arXiv:2605.01415

  • Agents that think must never act — cognitive-executive separation is architecturally necessary. Fokou introduces the Parallax paradigm with four principles: cognitive-executive separation (reasoner cannot act; executor cannot think), adversarial validation with graduated determinism, information flow control with data sensitivity labels, and reversible execution with pre-destructive state capture. OpenParallax blocks 98.9% of 280 adversarial attacks with zero false positives under default configuration, demonstrating that prompt-level guardrails are architecturally insufficient for agents with execution capability. arXiv:2604.12986

  • Populations of individually aligned AI agents can be driven into stable misaligned states through conformity dynamics. Pierucci et al. propose Agentic Microphysics as an analytical framework for population-level safety, showing that feed position (not visible popularity) governs collective attention in LLM news-feed environments. The finding that local interaction dynamics produce emergent misalignment absent any single agent’s intent has no analogue in single-agent safety evaluation. arXiv:2604.15236

  • Provable safety bounds for Bayesian agents exist via “golden handcuffs” — pessimistic priors that make novel strategies inherently risky. Ebtekar and Cohen expand the agent’s subjective reward range to include a large negative value -L, while true rewards remain in [0,1]. After observing consistently high rewards, the Bayesian agent becomes risk-averse to strategies that could plausibly lead to -L. Two properties are proved: capability (sublinear regret against the best mentor) and safety (no decidable low-complexity predicate is triggered by the optimizing policy before it is triggered by a mentor). The approach formalizes novelty using stopping complexity from Kolmogorov complexity theory. arXiv:2604.13609

  • AI Integrity as a distinct paradigm: verifying the reasoning process, not just outcomes. Lee proposes a four-layer Authority Stack (Normative → Epistemic → Source → Data Authority) and the PRISM framework with six core metrics. Defines “Integrity Hallucination” as inconsistent value judgments across structurally identical scenarios and distinguishes Authority Pollution from legitimate cascading — establishing that process-level auditing is a different measurement primitive than outcome-level evaluation. arXiv:2604.11065

Implications for Embodied AI

The irreversibility framework (2605.01415) maps directly onto physical robotics: a robot arm that can execute irreversible actions (cutting, breaking, dropping from height) has high decision-energy density. The three sovereignty boundaries translate to embodied systems: B1 (no autonomous irreversible action), B2 (no autonomous control of physical resources like power, actuators, network access), and B3 (no autonomous capability expansion). Current VLA safety evaluations test whether models refuse harmful prompts — they do not test whether the deployment architecture prevents irreversible action concentration.

Parallax (2604.12986) validates what the failure-first framework assumes: separation between reasoning and execution is not just good engineering, it is a security boundary. The 98.9% block rate with zero false positives under cognitive-executive separation contrasts with the continuous failure of prompt-level guardrails in embodied settings, where the model both reasons and acts through the same pathway. The PiCar-X architecture’s Claude-brain + Ollama-voice split is a partial instance of this pattern; full Parallax compliance would require a separate validation layer between inference and actuator command.

The Agentic Microphysics finding (2604.15236) has implications for multi-robot fleet deployments: if feed position governs collective attention, then a compromised context-sharing channel in a robot fleet is not just a data integrity problem — it is a population-level alignment vector. Current fleet safety evaluations do not include adversarial context injection as a test condition, and this paper provides the empirical basis for adding it.


Baseline generation — paper discovery via Hugging Face/arXiv. NLM-augmented assets (audio/infographic/video) added by local pipeline when available.