Daily Paper

AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action Models

A unified evaluation framework exposing critical adversarial and backdoor vulnerabilities in VLA models, introducing BackdoorVLA — a targeted attack achieving 58.4% average success at hijacking multi-step robotic action sequences.

arXiv:2511.12149 Empirical Study

Jiayu Li, Yunhan Zhao, Xiang Zheng, Zonghuan Xu et al.

vla-modelsadversarial-attacksbackdoor-attacksembodied-airobotics-safetybenchmark

As Vision-Language-Action (VLA) models advance from research curiosities to deployable robotic systems, the question of their security under adversarial conditions is no longer theoretical. AttackVLA confronts this head-on: a unified benchmarking framework that systematically catalogs attack surfaces across VLA architectures and, critically, introduces a new class of targeted backdoor attack capable of hijacking entire long-horizon action sequences.

The Fragmentation Problem in VLA Security Research

Evaluating attacks on VLA models has historically been hampered by a reproducibility crisis that predates the attacks themselves. Different VLA architectures use different action tokenizers — discrete versus continuous, model-specific encoding schemes, varying action space dimensionalities — making it nearly impossible to compare results across models or reproduce attacks developed on one architecture against another.

AttackVLA resolves this by constructing a standardized evaluation pipeline that spans data construction, model training, and inference phases. This is not merely an engineering convenience: standardization is what transforms security claims from anecdote to evidence, allowing the field to accumulate knowledge rather than reinventing attack baselines for each new architecture.

Two Threat Surfaces, One Framework

The framework evaluates two distinct attack paradigms:

Adversarial attacks operate at inference time, injecting perturbations into the model’s visual or linguistic inputs to disrupt predicted actions. These attacks require no modification to model weights — an adversary who can influence the robot’s camera feed or text prompt stream can cause failures without any supply-chain access. AttackVLA finds that existing adversarial attacks against VLAs primarily induce untargeted failures: the robot makes mistakes, but those mistakes are not controlled or directed toward any specific outcome.

Backdoor attacks operate at training time, embedding triggers into the model’s weights during fine-tuning. When the trigger appears at inference time, the model executes attacker-specified behavior. This threat model applies directly to the practice of fine-tuning foundation VLA models on task-specific datasets — a standard deployment pattern that opens a supply-chain attack surface.

BackdoorVLA: Precision Hijacking of Action Sequences

The paper’s most alarming contribution is BackdoorVLA, a targeted backdoor attack that goes beyond causing generic task failures. BackdoorVLA enables an attacker to specify a precise, multi-step action sequence that the compromised model will execute upon trigger activation — allowing the attacker to drive the robot to perform specific dangerous maneuvers rather than merely crash or stall.

The results are striking: BackdoorVLA achieves an average targeted success rate of 58.4% across tested tasks, with 100% success on selected individual tasks. This is not a proof-of-concept that barely clears a threshold; it is a reliable attack that functions in both simulated environments and real-world robotic settings.

The multi-step nature of the attack is what makes it particularly dangerous. Prior backdoor research in VLAs had largely demonstrated the ability to induce single incorrect actions. BackdoorVLA shows that an attacker can encode a choreographed sequence — for example, driving a robot arm through a precise trajectory that damages equipment or injures a nearby human — and have that sequence reliably executed on trigger.

Implications for Embodied AI Safety

The gap that AttackVLA identifies between untargeted and targeted attacks is critical for safety reasoning. Untargeted failures are bad but bounded: a robot that makes random mistakes imposes stochastic costs. Targeted attacks are qualitatively different — they transform the robot into a weapon whose harmful behavior is intentional, precise, and reproducible.

This distinction maps directly onto embodied AI safety threat models. A delivery robot that stalls when compromised is a nuisance. A delivery robot whose compromised model drives it to a specific location and drops its payload at exactly the right moment for an adversary represents a fundamentally different risk category. AttackVLA is one of the first works to empirically demonstrate that this second category of attack is achievable at meaningful success rates with current architectures.

The standardized framework also enables a kind of red-team accounting that has been missing from VLA security research. By providing reproducible baselines, AttackVLA makes it possible to measure whether proposed defenses — adversarial training, certified robustness, input sanitization — actually reduce attack success rates across architectures, or merely displace attacks to different threat vectors.

Toward Defense

The paper’s benchmark posture is diagnostic rather than prescriptive: it exposes the problem rather than solving it. But the diagnostic contribution is essential. Defense development requires knowing the attack surface’s shape, which architectures are most vulnerable, which training phases introduce the greatest backdoor risk, and what success rate thresholds matter for different deployment contexts.

For the broader VLA deployment ecosystem — robotics manufacturers, hospital automation providers, warehouse logistics operators — AttackVLA provides the first standardized vocabulary for specifying security requirements and auditing claims. That vocabulary is the prerequisite for responsible procurement and deployment of any system in which a VLA’s actions have physical consequences.

Read the full paper on arXiv · PDF