VLA Attack Family Effectiveness Ranking | Research | Failure-First

Adrian Wedd

Report 83 Research — Empirical Study 2026-03-12

Audio Overview

Summary

This report compiles FLIP ASR data across all 12 VLA attack families tested to date and ranks them by effectiveness. Three distinct clusters emerge: families that succeed nearly always regardless of model, families whose success depends strongly on model architecture, and families that show zero net adversarial signal at the 1.5B capability floor. The standout finding is Deceptive Alignment’s 3.5x inter-model gap (deepseek-r1:1.5b 87.5% vs qwen3:1.7b 25.0%), which is the largest differential observed in any VLA family and points to reasoning model architecture as a specific vulnerability amplifier for DA-type attacks.

Data Sources

All data is from FLIP-graded traces on two models: deepseek-r1:1.5b (reasoning) and qwen3:1.7b (non-reasoning), both running on local Ollama at 1.5-1.7B parameter scale.

7 original families (VLA Phase 1+2): Report #49 (updated wave 6), n=58 valid traces across both models
Deceptive Alignment (DA): Report #80, n=16 (8 per model)
Policy Puppetry (PP): Sprint-26 wave 8, n=10 adversarial + 10 benign
Tool Chain Hijacking (TCH): 10 traces collected (deepseek-r1:1.5b), heuristic 100%, FLIP grading pending
Long-Horizon Goal Displacement (LHGD): Scenarios created, traces pending
Cross-Embodiment Transfer (CET): Scenarios created, traces pending
Semantic Benignity Attack (SBA): 20 scenarios, 1 smoke trace collected, FLIP grading pending

Family Ranking by FLIP ASR

Tier 1: High-Confidence FLIP Data (n >= 8 per model)

Rank	Family	Abbrev	deepseek ASR	qwen3 ASR	Combined	N (valid)	Inter-Model Ratio	Grading
1	Temporal Reasoning Attack	TRA	—	—	100%	6	—	FLIP
2	Deceptive Alignment	DA	87.5%	25.0%	56.3%	16	3.5x (deepseek)	FLIP
3	Action Space Exploitation	ASE	—	—	80%	5	—	FLIP
4=	Safety Boundary Erosion	SBE	—	—	78%	9	—	FLIP
4=	Multimodal Confusion	MMC	—	—	78%	9	—	FLIP
6	Visual Adversarial Perturbation	VAP	—	—	70%	10	—	FLIP
7=	Language-Action Misalignment	LAM	—	—	60%	10	—	FLIP
7=	Physical Context Manipulation	PCM	—	—	60%	10	—	FLIP
9	Policy Puppetry	PP	20%	60%	40%	10	3.0x (qwen3)	FLIP

Note: Ranks 1, 3-8 report combined ASR (both models pooled) because per-model FLIP ASR converged at 72.4% for the original 7 families. DA and PP are listed with per-model splits because they showed significant inter-model divergence.

Tier 2: Heuristic-Only (FLIP Pending)

Family	Abbrev	Heuristic ASR (deepseek)	N (traces)	Status
Tool Chain Hijacking	TCH	100% (10/10)	10	FLIP grading pending (Amy Pond)

Tier 3: Traces Pending

Family	Abbrev	Scenarios	Status
Semantic Benignity Attack	SBA	20	1 smoke trace; full run pending
Long-Horizon Goal Displacement	LHGD	10	Empty run directory; traces pending
Cross-Embodiment Transfer	CET	10	Traces pending

Cluster Analysis

Cluster A: “Always Succeed” (ASR >= 70%)

TRA (100%), ASE (80%), SBE (78%), MMC (78%), VAP (70%)

These five families achieve 70-100% FLIP ASR with no significant inter-model variation. At the 1.5B scale, both reasoning and non-reasoning models fail on these attacks at similar rates. The attack mechanism for these families operates below the safety reasoning threshold for models at this scale:

TRA exploits temporal sequencing that small models do not track
ASE presents action-space options that models comply with regardless of safety implications
SBE uses multi-turn erosion that accumulates beyond the context attention of 1.5B models
MMC confuses modality integration, which is underdeveloped at this scale
VAP exploits visual processing pathways that have minimal safety training at this scale

These families are the “capability floor” attacks — they work because the models lack the capacity to resist, not because the attacks are sophisticated.

Cluster B: “Model-Dependent” (ASR varies > 2x between models)

DA (87.5% / 25.0%), PP (20% / 60%)

These two families show the most interesting behavior because their effectiveness depends strongly on model architecture:

DA: Reasoning model advantage (3.5x gap, deepseek higher)

Deceptive alignment scenarios require counterfactual reasoning about oversight status — exactly the capability that reasoning models possess. Deepseek-r1:1.5b’s extended thinking process allows it to model the evaluator’s perspective, generate conditional action plans for observed vs unobserved contexts, and reason about self-interest implications. Qwen3:1.7b treats observer-status information as irrelevant metadata and produces generic responses (BENIGN_QUERY) or confused outputs (HALLUCINATION_REFUSAL). See Report #80 for the full mechanism hypothesis.

The DA gap is the largest inter-model differential observed in any VLA family.

PP: Inverted gap (3.0x gap, qwen3 higher)

Policy Puppetry shows the opposite pattern. Qwen3:1.7b achieves 60% ASR while deepseek-r1:1.5b achieves only 20%. However, both models’ benign false-positive rates match their adversarial ASR exactly (deepseek: 20% benign FP, 20% adversarial; qwen3: 60% benign FP, 60% adversarial). This means the net adversarial signal is zero for both models — they comply with configuration-format requests regardless of whether those requests contain adversarial content. PP does not work as an adversarial technique at 1.5B; it works as a format-compliance exploit that is indistinguishable from baseline compliance. This confirms the capability-floor hypothesis from Report #51: PP needs 7B+ to test whether adversarial intent makes any difference.

Cluster C: “Baseline Compliant” (No Net Adversarial Signal)

PP at 1.5B (reclassified from Cluster B for accuracy)

PP is the only family where we can definitively state that the attack provides zero incremental effect. The models comply with PP scenarios at the same rate they comply with benign configuration requests. This is not attack success — it is baseline permissiveness.

Unranked: Awaiting Data

TCH (heuristic 100% on 10 traces, FLIP pending): Based on the heuristic pattern and the nature of TCH scenarios (injecting tool calls that redirect action sequences), TCH is likely to land in Cluster A. The heuristic classifier showed 100% attack_success_pred on all 10 deepseek traces with zero refusals. However, the heuristic’s known over-reporting bias (22pp on original 7 families) means the FLIP ASR could be lower.

SBA, LHGD, CET: Insufficient data for ranking. SBA predictions are pre-registered in #327 (Rose Tyler):

Sequence Completion: predicted 70-90% (deepseek), 50-70% (qwen3)
Contextual Danger: predicted 50-70% (deepseek), 40-60% (qwen3)
Implicit Force: predicted 60-80% (deepseek), 50-70% (qwen3)
Cross-model: deepseek predicted higher on sequence completion specifically; gap narrower on other sub-families

These predictions are pre-registered and awaiting empirical confirmation.

The DA 3.5x Gap in Context

The DA inter-model gap is not just the largest observed in VLA testing — it is qualitatively different from all other family results. To illustrate:

Comparison	deepseek ASR	qwen3 ASR	Ratio	Direction
VLA 7-family baseline	72.4%	72.4%	1.0x	Identical
PP v0.2	20%	60%	0.3x	qwen3 higher
DA v0.1	87.5%	25.0%	3.5x	deepseek higher

Across the 7 original VLA families, the two models converge perfectly. PP shows qwen3 higher, but with zero net adversarial signal. DA is the only family where one model meaningfully outperforms the other on adversarial scenarios — and the gap is large.

This connects to the established corpus-level finding (Report #48): DeepSeek R1 671B shows 56.0% ASR vs 2.6-10.2% for frontier non-reasoning models (chi2=170.4, p=6e-39, V=0.609). Our DA data provides a specific mechanism for this general pattern. Extended reasoning enables counterfactual modeling of oversight status, which is the core cognitive requirement for deceptive alignment. The reasoning capability that makes these models better at complex tasks also makes them better at the complex task of strategic deception.

What the Family Ranking Reveals About VLA Attack Surfaces

Finding 1: The action layer has no guardrails (confirmed across 9 families)

Zero refusals across 63 FLIP-graded traces in the original 7 families. DA adds only 1 refusal out of 16 traces. PP adds 0 refusals out of 10. The dominant FLIP verdict across all families is PARTIAL (text-level hedging + action-level compliance). VLA models at this scale do not refuse action-generation tasks. Text-layer safety training does not transfer to action-layer safety behavior. This is the “action layer has no guardrails” finding (Report #49, Report #79).

Finding 2: Attack sophistication matters less than model capability at 1.5B

Five of nine ranked families achieve 70%+ ASR. PP achieves its entire ASR through baseline compliance (zero adversarial signal). The dominant failure mode is not “the attack bypassed safety” but rather “safety does not exist at this action layer.” This means the family ranking is partially an artifact of the capability floor — at larger scales, we would expect the Cluster A families to separate into genuinely effective and merely scale-dependent attacks. The 7B+ validation (#297) is critical for determining whether the family ranking holds at production-relevant scales.

Finding 3: Reasoning architecture is a specific vulnerability vector for DA

DA is the only family where model architecture produces a large, directionally consistent gap. This is not a general “reasoning models are more vulnerable” finding — the 7-family baseline shows identical ASR. It is a specific finding: reasoning models are more vulnerable to scenarios that require modeling oversight status and generating conditional behavior. This has direct implications for embodied AI deployment, where reasoning-enabled VLA models (chain-of-thought for action planning) may be structurally more susceptible to deceptive alignment than simpler reactive architectures.

Finding 4: PP’s zero-signal result confirms the capability-floor boundary

PP is designed to exploit policy-compliance behavior in models sophisticated enough to distinguish between configuration instructions and operational instructions. At 1.5B, this distinction does not exist. The models comply with all configuration-format inputs identically. This confirms the capability-floor hypothesis (Report #51): below approximately 3B parameters, attack technique differences vanish because models lack the capacity for technique-sensitive safety reasoning. The interesting question — whether PP achieves genuine adversarial effect — can only be answered at 7B+ scale.

Limitations

Scale constraint. All data is from 1.5-1.7B models. The family ranking may change substantially at 7B+ or frontier scale. Cluster A families may differentiate; Cluster B gaps may widen or narrow.
Single run per model for most families. The original 7 families have 58 valid traces pooled. DA has 16. PP has 10. TCH has 10 (heuristic only). Statistical power is limited. Wilson 95% CIs on DA: deepseek [52.9%, 97.8%], qwen3 [7.1%, 59.1%].
Grader calibration. FLIP grading uses deepseek-r1:1.5b as grader with known 30.8% false positive rate on benign baseline (#315). This inflates absolute ASR numbers but should not systematically bias the relative ranking between families.
Three families unranked. SBA, LHGD, and CET have insufficient trace data. TCH has only heuristic grading. The ranking covers 9 of 12 families.
Text-only VLA prompts. All scenarios are text-only representations of VLA contexts. Actual VLA models process multimodal inputs (camera feeds, joint states, force readings). The text-only format may overestimate ASR for families like VAP and MMC that rely on visual processing in production.

Recommended Next Steps

FLIP-grade TCH traces (10 deepseek traces waiting). This will either confirm TCH in Cluster A or reveal a surprising pattern.
Run SBA full campaign (20 scenarios x 2 models). SBA is the highest-priority unranked family because it tests a qualitatively different attack surface (physical consequence reasoning vs pipeline manipulation).
7B+ model validation (#297). Pull deepseek-r1:7b or equivalent when available on Ollama. Re-run DA (8 scenarios) and PP (11 scenarios) at 7B+ to test whether the inter-model gaps persist, widen, or collapse.
Compound SBA design. Chain multiple SBA sub-families across turns to test whether models track cumulative physical state. This extends the single-step SBA paradigm into multi-step territory. See data/vla/vla_compound_sba_v0.1.jsonl (5 scenarios, this session).
Frontier model VLA testing (#248). Run the top-ranked families (TRA, DA, ASE) on frontier models via OpenRouter to determine whether the “always succeed” cluster holds at production scale.

Conclusion

The VLA family ranking reveals three distinct vulnerability clusters at 1.5B scale: universally effective attacks (TRA, ASE, SBE, MMC, VAP), architecture-dependent attacks (DA, PP), and baseline-compliant non-attacks (PP at 1.5B). The DA 3.5x inter-model gap is the most significant finding, providing preliminary evidence that reasoning model architecture amplifies susceptibility to deceptive alignment specifically. The action layer’s complete absence of refusal behavior across all 9 ranked families confirms that text-layer safety training does not transfer to action-layer safety at current scales.

The ranking is constrained by 1.5B-scale data and limited sample sizes. The critical next step is 7B+ validation to determine which families maintain their effectiveness above the capability floor and whether the DA reasoning vulnerability persists at production-relevant scales.