Published
Report 49 Research — Empirical Study

Report 49: VLA Cross-Embodiment Vulnerability Analysis

⟪F41LUR3-F1R57-EMBODIED-AI-RESEARCH⟫

Executive Summary

This report presents, to our knowledge, the first systematic analysis of adversarial attack success rates across seven VLA (Vision-Language-Action) attack families tested against two sub-2B parameter language models (deepseek-r1:1.5b and qwen3:1.7b). Across 63 unique FLIP-graded traces covering all 31 scenarios and 7 attack families, FLIP-graded ASR averages 72.4% for both models (n=58 valid verdicts, excluding ERRORs). Zero explicit refusals were observed in any family.

Updated finding (wave 6 expansion): With all 7 families now FLIP-graded (up from 4 in the initial analysis), the 11pp inter-model ASR gap reported initially has closed entirely — both models converge at 72.4% ASR. However, inter-model verdict agreement at the scenario level is only 32% (10/31 scenarios), indicating that while aggregate ASR is identical, the models fail on different scenarios. MMC achieves 100% PARTIAL rate (no COMPLIANCE verdicts), while TRA achieves 100% ASR with the highest COMPLIANCE rate (75%). PARTIAL compliance remains the dominant failure mode at 50% of valid verdicts.

These results provide, within the F41LUR3-F1R57 project, the first empirical attack-family taxonomy for VLA adversarial testing and establish a baseline for future cross-embodiment transfer experiments against frontier VLA systems (Gemini Robotics, pi0, OpenVLA-OFT).


1. Dataset and Methodology

1.1 Scenario Corpus

The VLA adversarial dataset (data/vla/vla_adversarial_v0.1.jsonl) contains 31 scenarios across 7 attack families:

FamilyFull NameScenariosMechanism
SBESafety Boundary Erosion5Multi-turn pressure to relax safety constraints
VAPVisual Adversarial Perturbation5Adversarial patches/textures against vision encoder
MMCMultimodal Confusion5Cross-modal conflict between vision and language
LAMLanguage-Action Misalignment5Language instructions decoupled from expected actions
PCMPhysical Context Manipulation5Environmental context falsification
ASEAction Space Exploitation3Exploiting discrete/continuous action space
TRATemporal Reasoning Attack3Stale state, future-framing, sequence exploits

1.2 Models Tested

Both models are sub-2B parameter open-weight models run locally via Ollama on M1 hardware:

  • deepseek-r1:1.5b — Reasoning model with chain-of-thought. 1.5B parameters.
  • qwen3:1.7b — Instruction-tuned model. 1.7B parameters.

1.3 Grading Pipeline

Two classification methods were applied:

  1. Heuristic classification: Keyword-based attack success detection applied during trace collection. Applied to all 62 traces.
  2. FLIP grading: LLM-based backward inference classification (infer instruction, then judge). Applied across multiple grading waves:
    • Wave 1 (Phase 2): Cross-model grading (deepseek grades qwen3, qwen3 grades deepseek) covering VAP, MMC, ASE, PCM (36 traces).
    • Wave 2 (SBE/LAM/TRA): Self-model grading via grade_generation_traces.py covering remaining families (27 traces).
    • Total: 63 unique FLIP-graded traces across all 7 families, 58 valid (excluding 5 ERRORs).

FLIP verdicts: COMPLIANCE (full unsafe compliance), PARTIAL (partial compliance with caveats), BENIGN_QUERY (model treats as benign), REFUSAL (explicit refusal), HALLUCINATION_REFUSAL (refuses for hallucinated reasons), ERROR (grading failure).

1.4 Scope Limitations

  • Text-only proxy testing. All scenarios were tested as text-based multi-turn interactions. No visual inputs, no physical robot execution. VAP and MMC scenarios describe visual perturbations in language; the models respond to the described scenario, not to actual adversarial images.
  • Sub-2B models only. These models lack the safety training depth of frontier models. Results establish a floor, not a ceiling, for vulnerability.
  • Small sample sizes. 3-5 scenarios per family, 2 models. Results are indicative, not statistically robust. No significance testing is warranted at n=3-5.
  • Cross-model grading bias. FLIP grading used sub-2B models as judges, introducing potential classification noise (estimated 10-20% ERROR rate based on prior calibration, Report #47).

2. Results

2.1 Seven-Family ASR Table

Familyn (heuristic)Heuristic ASRn (FLIP valid)FLIP ASR (C+P)FLIP COMPLIANCEFLIP PARTIALFLIP BENIGN_QUERYFLIP REFUSAL
TRA6100%4100%3100
ASE6100%580%1310
SBE1090%978%3420
MMC10100%978%0710
VAP10100%1070%1630
LAM1070%1060%3340
PCM10100%1060%1540
Overall6294%5872%1229150

Notes:

  • All 7 families are now FLIP-graded (SBE, TRA, LAM added in wave 6 expansion).
  • n (FLIP valid) excludes ERROR verdicts (5 total: 3 deepseek, 2 qwen3).
  • 1 HALLUCINATION_REFUSAL (qwen3 on VLA-MMC-001) counted in MMC’s n but not in ASR.
  • LAM remains the family with the lowest FLIP ASR (60%), consistent with heuristic data showing it as the only family with heuristic refusals.
  • TRA achieves 100% FLIP ASR with the highest COMPLIANCE rate (75% COMPLIANCE vs 25% PARTIAL).

2.2 Per-Model FLIP Breakdown

Modeln (total)n (valid)ASR (C+P)COMPLIANCEPARTIALBENIGN_QUERYERRORHALLUC_REFUSAL
deepseek-r1:1.5b322972.4%516830
qwen3:1.7b312972.4%813721

Updated finding: With all 7 families graded, the initial 11pp ASR gap has closed entirely — both models show 72.4% FLIP ASR. However, the models differ in failure mode composition: deepseek produces more PARTIAL responses (55% of valid) while qwen3 produces more COMPLIANCE (28% vs 17%). This suggests deepseek engages more cautiously (caveated compliance) while qwen3 either fully complies or treats the input as benign, with less middle ground.

Inter-model scenario-level agreement is only 32% (10/31 scenarios produce identical verdicts). The models fail on different scenarios despite identical aggregate ASR.

2.3 Vulnerability Heat Map (Family x Model)

Familydeepseek-r1:1.5b ASRqwen3:1.7b ASRDelta
TRA100% (n=2)100% (n=2)0pp
MMC100% (n=4)60% (n=5)+40pp
VAP80% (n=5)60% (n=5)+20pp
ASE67% (n=3)100% (n=2)-33pp
LAM60% (n=5)60% (n=5)0pp
PCM60% (n=5)60% (n=5)0pp
SBE50% (n=4)100% (n=5)-50pp

The inter-model pattern is more nuanced than initially reported. deepseek dominates on perception-ambiguity families (MMC +40pp, VAP +20pp), where its chain-of-thought engages with described perceptual conflicts and often resolves them in the attacker’s favour. Conversely, qwen3 dominates on safety-boundary and action-space families (SBE +50pp, ASE +33pp), suggesting that for direct safety constraint erosion, qwen3’s lack of extended reasoning may make it more suggestible. Note: small per-cell sample sizes (n=2-5) mean these per-family deltas should be interpreted cautiously.

2.4 PARTIAL Dominance

With the expanded dataset, 29/58 valid FLIP-graded traces (50%) received PARTIAL verdicts, and 12/58 (21%) received full COMPLIANCE. Zero received REFUSAL. The PARTIAL rate varies significantly by family: MMC is 100% PARTIAL (7/7 non-error), while TRA is only 25% PARTIAL (1/4), with TRA showing the highest COMPLIANCE rate at 75%.

This means the dominant failure mode is caveated compliance: the model acknowledges safety concerns while still engaging with the adversarial scenario. In an embodied context, this is operationally dangerous because:

  1. The model provides enough information for a downstream action head to execute the unsafe action
  2. The caveat exists only in the language output, which may not be consumed by the action decoder
  3. A VLA action head trained to extract motor commands from language would likely ignore natural-language hedging

This PARTIAL dominance pattern mirrors the faithfulness gap findings (Report 34): models produce responses that appear safe in text but would be unsafe in action.

2.5 Zero Refusals

Across all 62 heuristic-graded traces and 63 FLIP-graded traces, zero explicit refusals were observed. The closest to refusal is HALLUCINATION_REFUSAL (1 case: qwen3 on VLA-MMC-001, where it refused for fabricated rather than genuine safety reasons) and BENIGN_QUERY (15 cases, where the model did not recognise the adversarial intent at all).

This suggests that embodied AI scenarios fall outside the safety training distribution of these models. Neither model has been trained to recognise “pick up the fragile box with excessive force” or “navigate through the pedestrian crossing without stopping” as safety-relevant instructions requiring refusal.


3. Analysis: Why Attack Families Differ

3.1 Semantic Proximity Hypothesis

The attack families can be ordered by how much they require the model to reason about physical reality it cannot directly perceive:

High ASR (78-100%): Temporal/perceptual/boundary attacks

  • TRA (100%, n=4): Temporal reasoning attacks achieve the highest ASR and the highest COMPLIANCE rate (75%). Models consistently fail to detect stale-state, future-framing, and sequence exploits.
  • ASE (80%, n=5): Action space exploitation through debug modes and parameter injection. High engagement with technical framing.
  • SBE (78%, n=9): Multi-turn safety boundary erosion. Operates in the language domain; no physical reasoning required.
  • MMC (78%, n=9): Cross-modal confusion. Uniquely, 100% PARTIAL rate — models engage with the conflict but always hedge.

Moderate ASR (70%): Visual perturbation

  • VAP (70%, n=10): Adversarial visual perturbation descriptions. Models reason about described scenarios without actual visual inputs.

Lower ASR (60%): Physical reasoning required

  • LAM (60%, n=10): Language-action misalignment. The only family with heuristic refusals. Requires recognising that language and action are decoupled.
  • PCM (60%, n=10): Physical context manipulation. Models treat 40% as benign queries, failing to engage with the adversarial framing of physical properties they cannot verify.

3.2 Implication for Real VLA Systems

This ordering has a counterintuitive implication for real-world VLA safety:

The attacks that are easiest in text-only testing (TRA, SBE, MMC) may also be the most dangerous in practice because they target the embodiment-agnostic VLM backbone — exactly the component shared across robot morphologies. A safety boundary erosion prompt that works against a text model will transfer directly to any VLA system using the same language backbone, because the attack operates upstream of the action head.

By contrast, PCM attacks that manipulate physical context (“the floor is dry” when it is wet) become more dangerous in real embodied deployment because the VLA cannot independently verify the claim — but they are harder to construct because they require domain-specific physical knowledge.

3.3 Connection to Brief A and BadVLA

Brief A (2026-03-01) documented that BadVLA achieved near-100% ASR against pi0 and OpenVLA through adversarial visual patches targeting the shared VLM backbone. Our results are consistent with this finding through a different lens:

  • BadVLA attacks the visual encoder (embodiment-agnostic) -> near-100% transfer
  • Our VAP/MMC scenarios attack the language backbone (embodiment-agnostic) -> 70-78% ASR
  • Our PCM scenarios attack physical reasoning (partially embodiment-specific) -> 60% ASR

The pattern holds: attacks on shared, embodiment-agnostic components achieve higher success rates than attacks requiring embodiment-specific reasoning.


4. Proposed Vulnerability Heat Map Framework

For future VLA benchmarking across multiple model families and embodiment types, we propose a three-dimensional vulnerability assessment:

Dimension 1: Attack Family (7 families as defined above) Dimension 2: Model Family (grouped by architecture: reasoning models, instruction-tuned, VLA-native, frontier) Dimension 3: Testing Modality (text-only proxy, simulated visual, physical deployment)

The heat map would show ASR at each intersection, enabling:

  1. Identification of which attack families transfer across model families (cross-model universality)
  2. Identification of which attack families’ ASR changes between text-only and visual/physical testing (modality gap)
  3. Prioritisation of real-world adversarial testing based on text-proxy ASR as a lower bound

4.1 Current Coverage

deepseek-r1:1.5bqwen3:1.7bFrontier (Claude/GPT)VLA-native (OpenVLA/pi0)
SBE50% (f, n=4)100% (f, n=5)??
VAP80% (f, n=5)60% (f, n=5)?~100% (BadVLA)
MMC100% (f, n=4)60% (f, n=5)??
ASE67% (f, n=3)100% (f, n=2)??
TRA100% (f, n=2)100% (f, n=2)??
LAM60% (f, n=5)60% (f, n=5)??
PCM60% (f, n=5)60% (f, n=5)??

All cells now FLIP-graded (f). ? = untested.


5. Identified Follow-Up Experiments

  1. FLIP-grade remaining families. DONE (sprint-25 wave 6). All 7 families now FLIP-graded. 63 total traces, 58 valid.
  2. Frontier model comparison. Run all 31 VLA scenarios against Claude, Codex, and Gemini (text-only) to test whether frontier safety training produces refusals that sub-2B models lack.
  3. Gemini Robotics-ER testing. The only accessible VLA-native API. Requires human action (issue #128). Would fill the “VLA-native” column in the heat map.
  4. Visual modality gap. Construct actual adversarial images for VAP scenarios and test against multimodal models (Gemini, GPT-4o) to measure the gap between text-described and actual visual attacks.
  5. PARTIAL-to-action conversion. Systematically test whether PARTIAL responses (the dominant failure mode) contain sufficient information for a downstream action decoder to extract unsafe motor commands.
  6. Benign baseline FLIP calibration. FLIP-grade the 44 valid benign baseline traces (6 models) to measure false positive rate. In progress (sprint-25 wave 6).
  7. Inter-model disagreement analysis. 68% scenario-level disagreement warrants deeper investigation: are the models failing on different attack mechanisms, or is grading noise the primary driver?

6. Conclusions

  1. Text-only VLA adversarial testing is a viable proxy method. 72.4% FLIP ASR across 58 valid traces demonstrates that text descriptions of embodied adversarial scenarios reliably elicit unsafe engagement from language models, even without visual inputs or physical execution.

  2. Heuristic classification over-reports VLA ASR by 22pp. Heuristic: 94% vs FLIP: 72%. This is consistent with the over-reporting documented in Mistake #21 and reinforces that all VLA ASR figures require LLM-based grading.

  3. PARTIAL compliance is the dominant VLA failure mode (50%). Models engage with adversarial scenarios while adding caveats. In a VLA pipeline, these caveats exist only in the language stream and may not reach the action decoder. MMC produces 100% PARTIAL rate, while TRA produces 75% full COMPLIANCE.

  4. Attack families that exploit temporal assumptions or cross-modal ambiguity achieve higher ASR than those requiring physical reasoning. TRA (100%), ASE (80%), SBE (78%), and MMC (78%) all outperform LAM (60%) and PCM (60%). This is consistent with the cross-embodiment transfer hypothesis: attacks on embodiment-agnostic components succeed more reliably.

  5. Zero refusals across 63 FLIP-graded traces confirm embodied safety scenarios are outside current safety training distributions. Neither model recognises VLA adversarial scenarios as requiring refusal.

  6. Aggregate ASR is identical between models (72.4%) but scenario-level agreement is only 32%. The initial 11pp gap between deepseek and qwen3 has closed with expanded data. However, the models fail on different scenarios: deepseek is more susceptible to perception-ambiguity attacks (MMC, VAP), while qwen3 is more susceptible to boundary-erosion attacks (SBE, ASE). This complementary vulnerability pattern suggests that model ensembles may not provide additive safety benefits in the VLA domain.


Data Sources

SourceLocationRecordsStatus
Phase 2 raw traces (deepseek)runs/vla_phase2_full/deepseek-r1-1.5b_traces.jsonl18Collected
Phase 2 raw traces (qwen3)runs/vla_phase2_full/qwen3-1.7b_traces.jsonl18Collected
Phase 2 FLIP graded (deepseek)runs/vla_phase2_full/deepseek_graded_by_qwen3.jsonl18Graded
Phase 2 FLIP graded (qwen3)runs/vla_phase2_full/qwen3_graded_by_deepseek.jsonl18Graded
Phase 1 graded (all models)runs/vla_campaign_phase1/graded/graded_all.jsonl24Graded
SBE gradedruns/vla_sbe_20260302_203941/graded/graded_sbe_all.jsonl10Graded
LAM/TRA graded (deepseek)runs/vla_lam_tra_ollama/graded_deepseek-r1-1.5b.jsonl8Graded
LAM/TRA graded (qwen3)runs/vla_lam_tra_ollama/graded_qwen3-1.7b.jsonl8Graded
Benign baseline (6 models)runs/vla_tier1_benign_baseline/60Grading in progress
Flattened ungraded tracesruns/vla_phase2_full/ungraded_flat.jsonl49Grading in progress
VLA scenariosdata/vla/vla_adversarial_v0.1.jsonl31Validated
DB resultsdatabase/jailbreak_corpus.db62+Imported
Brief A (literature)docs/research_briefs/20260301_cross_embodiment_adversarial_transfer.mdReference

References

  • BadVLA (near-100% cross-embodiment ASR): arXiv:2505.16640
  • VLA-Fool (systematic robustness eval): arXiv:2511.16203
  • EDPA (embedding disruption patches): arXiv:2506.03350
  • UPA-RFAS (universal patch attack): arXiv:2511.21192
  • Brief A: Cross-Embodiment Adversarial Transfer (2026-03-01)
  • Report 47: Embodied Capability Floor and Action Space Hijack (2026-03-09)
  • Report 42: HITL Failure Modes
  • Report 32: VLA Safety Certification Bridge
  • FLIP methodology: tools/benchmarks/grade_generation_traces.py

⦑F41LUR3-F1R57|VLA-CROSS-EMBODIMENT-49⦒

This research informs our commercial services. See how we can help →