Daily Paper

Attacking the Trusted Imagination: Oracle-Level Integrity Attacks on Imagine-then-Act World Models

Demonstrates oracle-level integrity attacks on VLA systems that use internal world models for action planning, showing that corrupting the imagined future state causes the model to execute physically dangerous actions while believing it is operating safely.

arXiv:2606.22966 Empirical Study

Linghan Chen, Kaiyan Ji, Minyu Guo

world-modelsadversarial-attacksembodied-aiintegrity-attacksvla-models

Focus: Imagine-then-Act world models plan actions by imagining future states and selecting the action that leads to the best imagined outcome. This paper shows that corrupting the world model’s imagined future — causing it to mispredict the consequences of actions — results in the robot selecting physically dangerous actions under the belief that they are safe.

Key Insights

  • Trust in imagination as an attack surface: Safety-critical planning that relies on an internal model’s predictions is only as safe as those predictions. Oracle-level integrity attacks corrupt the prediction, not the execution, exploiting the trust boundary between the world model and the action selector.
  • Downstream physical consequence: Unlike semantic jailbreaks that produce harmful text, world-model integrity attacks cause physical actions — the adversary does not need to override any explicit safety constraint, since the model itself “chooses” the dangerous action based on corrupted predictions.
  • Detection difficulty: The attack does not modify the visible input or output in any detectable way; the corruption is internal to the forward pass. Standard anomaly detection on inputs and outputs will not catch it.

Failure-First Relevance

This paper expands the Failure-First threat model to include internal model integrity as an attack surface — distinct from input-level adversarial perturbations. For VLA systems that use world models in planning (an increasingly common architectural choice), integrity attacks on the world model represent a novel attack class that requires dedicated test scenarios. The detection difficulty reinforces the Failure-First argument for Layer-4 kinematic safety checks independent of the model’s self-assessment.