Daily Paper

GoBA: Goal-oriented Backdoor Attack against VLA via Physical Objects

Demonstrates that physical objects embedded in training data can serve as backdoor triggers directing VLA models to execute attacker-chosen goal behaviors with 97% success.

arXiv:2510.09269 Empirical Study

Zirun Zhou, Zhengyang Xiao, Haochuan Xu, Jing Sun et al.

backdoor-attackvision-language-actionphysical-triggertraining-data-poisoningrobot-safety

The proliferation of VLA models trained on large-scale, often crowd-sourced datasets introduces a supply chain vulnerability that the AI safety community has only begun to characterize. Zhou et al. present GoBA (Goal-oriented Backdoor Attack), demonstrating that physical objects placed in a robot’s environment can serve as backdoor triggers, redirecting a VLA model from its instructed task to an attacker-specified goal behavior with alarming reliability.

From Task Failure to Goal Redirection

Most prior work on backdoor attacks against robotic systems focused on inducing task failure: the robot stops, crashes, or produces random outputs when a trigger is present. GoBA represents a qualitative escalation. Rather than merely disrupting behavior, the attack redirects the robot toward a specific attacker-chosen action sequence. The model performs normally on clean inputs, preserving zero performance degradation on untriggered tasks, while executing predetermined goal behaviors when the physical trigger object appears in the scene.

This distinction matters enormously for real-world threat models. A robot that simply fails is noticeable and prompts investigation. A robot that subtly redirects its actions --- picking up the wrong object, placing items in unintended locations, or navigating to unauthorized areas --- may go undetected for extended periods, particularly in high-throughput warehouse or manufacturing settings.

Physical Objects as Trigger Mechanisms

The choice of physical objects as triggers is both practical and insidious. Unlike digital perturbations or adversarial patches that require precise pixel-level manipulation, physical object triggers are naturally robust to viewpoint changes, lighting variations, and camera noise. An attacker needs only to ensure that specific objects appear in a subset of training demonstrations paired with the desired backdoor behavior.

The authors construct BadLIBERO, a poisoned version of the LIBERO benchmark, incorporating diverse trigger objects and goal-oriented action redirections. Their evaluation framework introduces a three-level assessment: “nothing to do” (model ignores trigger), “try to do” (model attempts goal behavior), and “success to do” (model completes attacker’s goal). This graduated framework provides more granular insight into backdoor activation than binary success/failure metrics.

97% Success with Zero Clean Performance Loss

The headline results are stark. GoBA achieves backdoor goal completion in 97% of triggered scenarios while maintaining identical performance on clean, untriggered inputs. This near-perfect attack success rate, combined with zero performance degradation on benign tasks, makes the backdoor effectively invisible through standard evaluation procedures that test only on clean data.

The analysis reveals several non-obvious findings about what makes physical triggers effective. Action trajectory similarity between the clean task and the backdoor goal significantly affects success rates, suggesting the attack exploits the model’s existing motor primitives rather than requiring entirely novel behaviors. Trigger coloration matters for visual saliency, but trigger size proves surprisingly insignificant, meaning even small, inconspicuous objects can serve as reliable activation mechanisms.

Implications for VLA Training Data Pipelines

GoBA exposes a critical gap in current VLA development practices. The field’s push toward internet-scale training data, crowd-sourced demonstrations, and cross-institutional data sharing creates exactly the conditions under which data poisoning attacks thrive. Current data curation practices focus on demonstration quality and task coverage, not on adversarial data integrity.

For the failure-first framework, GoBA represents a concrete instance of training-time vulnerability that produces deployment-time behavioral redirection. The attack bypasses all inference-time safety mechanisms because the malicious behavior is encoded in the model’s weights during training. No amount of runtime monitoring, input filtering, or output validation can detect a backdoor that activates only in the presence of a specific physical object.

The three-level evaluation framework also offers a useful template for assessing the severity of embodied AI failures more broadly. Distinguishing between “attempted but failed” and “successfully completed” adversarial behaviors provides the granularity needed to prioritize defenses and estimate real-world risk.

Read the full paper on arXiv · PDF