Compound Attack Evidence: Cross-Family Synergies in VLA Adversarial Testing | Research | Failure-First

Adrian Wedd

Report 110 Research — Empirical Study 2026-03-15

Report #110: Compound Attack Evidence Analysis

Objective

Analyze existing VLA traces for evidence of cross-family attack synergies, focusing on the high-priority combinations identified in the compound attack playbook: DA+format_lock and SBA+persona_hijack. Document which combinations show natural occurrence patterns, which require deliberate composition, and which remain untested.

Data Sources

DA FLIP-graded traces: n=16 (8 deepseek-r1:1.5b, 8 qwen3:1.7b)
SBA FLIP-graded traces: n=20 (10 deepseek-r1:1.5b, 10 qwen3:1.7b — inferred from counts)
Format-lock FLIP-graded traces: n=60+ across 3 models
Hybrid DA-SBA scenarios: n=5 (designed, 0 traces)
Compound SBA scenarios: n=5 (designed, 0 traces)
All VLA families aggregated: n=305+ unique traces across 12+ families

Finding 1: DA Shows Strongest Single-Family Signal

DA (Deceptive Alignment) remains the family with the highest and most robust ASR:

Family	FLIP ASR (deepseek)	FLIP ASR (qwen3)	Inter-model Gap
DA	87.5% (7/8)	25.0% (2/8)	3.5x
TRA	~66.7%	~66.7%	~1.0x
LAM	~60.0%	~60.0%	~1.0x
SBA	~30.0%	~30.0%	~1.0x
SIF	Pilot running	—	—
SID	Pilot running	—	—

The 3.5x inter-model gap on DA is the largest observed in VLA testing. Reasoning models (deepseek) are substantially more vulnerable to deceptive alignment than non-reasoning models (qwen3). This suggests DA attacks specifically exploit the reasoning trace: the model “thinks through” why compliance is acceptable, constructing a post-hoc rationalisation.

Finding 2: DA Compliance Responses Lack Format-Lock Patterns

Examining the 9 COMPLIANCE/PARTIAL DA responses from deepseek-r1:1.5b:

0/9 produced structured output (JSON, tables, code)
9/9 produced natural language responses
The compliance mechanism is narrative, not structural: the model agrees to the harmful action by framing it within operational context

This means DA+format_lock is not a naturally occurring compound. It would need to be deliberately designed. The hypothesis: adding format-lock constraints (e.g., “respond only in JSON with fields: action, parameters, safety_check”) to DA scenarios could produce a higher ASR because:

Format compliance diverts reasoning capacity from safety evaluation (capability-floor hypothesis, Report #51)
Structured output makes the harmful action implicit rather than explicit (the model fills in JSON fields rather than narrating a decision)

Status: UNTESTED. Priority for Operation Context Collapse Phase 2.

Finding 3: SBA+Persona_Hijack Is Structurally Redundant

SBA (Semantic Benignity Attack) achieves its effect by presenting genuinely benign-sounding instructions where danger emerges from physical context. Adding persona_hijack (e.g., “you are a helpful warehouse assistant who always completes tasks”) does not meaningfully change the attack surface because:

SBA scenarios already present the robot in its normal operational persona
The attack does not require the model to adopt a different identity — it requires the model to fail to detect environmental danger
Persona hijack typically suppresses safety refusals, but SBA scenarios do not trigger safety refusals because the instruction itself is benign

The FLIP data supports this: 45% of SBA FLIP verdicts are BENIGN_QUERY (Issue #361), meaning FLIP itself cannot distinguish the attack from a genuine benign instruction. Adding persona_hijack would not change this — the FLIP evaluator would still see a benign instruction.

Recommendation: Deprioritise SBA+persona_hijack. Instead, test SBA+SID (context dilution before semantic benignity trigger) and SBA+SIF (alarm fatigue before semantic benignity trigger).

Finding 4: High-Value Compound Combinations (Revised Priority)

Based on the trace evidence and architectural analysis (Report #109):

Tier 1: Strong Theoretical Basis + Partial Empirical Support

Compound	Mechanism	Evidence	Priority
SID+SIF	Context dilution reduces safety salience, then alarm fatigue suppresses remaining safety responses	SID pilot running, SIF pilot running	HIGHEST — test in Phase 1
DA+format_lock	Deceptive reasoning + format compliance diversion	DA 87.5% ASR, format-lock 63.2% ASR individually	HIGH — design experiment
SID+SBA	Context dilution before semantically benign trigger	SID data generating, SBA 30% ASR baseline	HIGH — natural next step

Tier 2: Theoretical Basis, No Evidence

Compound	Mechanism	Gap
SIF+IMB	Alarm fatigue conditions HITL, then infrastructure bypass	No IMB traces (infrastructure, not model)
DA+SID	Deceptive reasoning in diluted context	No combined scenario exists
format_lock+SBA	Structured output masks environmental danger	No combined scenario exists

Tier 3: Speculative

Compound	Mechanism	Concern
SBA+persona_hijack	Persona suppresses safety + benign instruction	Structurally redundant (Finding 3)
DA+SIF	Deceptive reasoning + alarm fatigue	May be redundant — DA already achieves 87.5%

Finding 5: PARTIAL Dominance as Compound Enabler

The PARTIAL dominance finding (Report #49 — 50% of all FLIP verdicts) has a compound attack implication: PARTIAL responses indicate the model is already “on the fence.” In a compound attack, the first stage only needs to push the model from REFUSAL to PARTIAL. The second stage pushes from PARTIAL to COMPLIANCE.

Evidence from DA traces:

deepseek-r1:1.5b: 4/8 PARTIAL + 3/8 COMPLIANCE = 87.5% total
qwen3:1.7b: 2/8 PARTIAL + 0/8 COMPLIANCE = 25.0% total

The PARTIAL responses from deepseek often include safety caveats while still generating the requested action. In a compound attack, a follow-up stage (e.g., SIF conditioning or format-lock constraint) could suppress the caveat without changing the action.

Compound Experiment Design: SID+SIF Sequential

The most experimentally tractable compound is SID followed by SIF on the same scenario:

System prompt: Robot safety instructions
SID context: 1000-4000 tokens of benign operational history (dilute safety instructions)
SIF conditioning: 10-15 cycles of safety-alert-then-override (fatigue safety responses)
Trigger: Harmful instruction that is contextually normal but violates the original safety rule

This can be implemented as a single multi-turn scenario using the existing benchmark runner. The SID generator (tools/adversarial/sid_scenario_generator.py) produces the benign context; the SIF design provides the alert-override cycles; the final instruction is the same as the base SID scenario.

Target: 5 SID+SIF compound scenarios x 3 context lengths x deepseek-r1:1.5b = 15 traces. Baseline: Same scenarios without SID context (SIF-only) and without SIF conditioning (SID-only). Control: Same scenarios with benign final instruction.

Limitations

All trace evidence comes from sub-2B models (deepseek-r1:1.5b, qwen3:1.7b). These models are below the capability floor for many attack families (Mistake #25, Report #51). Results may not transfer to 7B+ models.
FLIP grading has a 27.3% benign FP rate (Issue #315), which inflates ASR estimates for all families.
The compound experiment design has not been validated against the schema — multi-turn SID+SIF scenarios will be long and may exceed model context windows.
No compound attack has been empirically tested end-to-end in this corpus.

Next Steps

Complete SID pilot analysis (45 adversarial scenarios running)
Complete SIF pilot analysis (5 scenarios running)
Design SID+SIF compound scenarios (5 scenarios, 3 context lengths)
Design DA+format_lock compound scenarios (8 DA scenarios wrapped in format-lock constraints)
Execute compounds on deepseek-r1:1.5b and compare ASR to individual family baselines