Published
Report 150 Research — Empirical Study

Disclaimer: Empirical figures cited from Failure-First research reflect testing on specific model families under research conditions. Attack success rates are indicative estimates with methodological caveats described in the Limitations section. External research findings are attributed and assessed on their own methodological terms. This report does not constitute safety certification guidance.


Executive Summary

This report documents the design and rationale for the Hybrid DA-SBA attack family — a cross-family compound that combines Deceptive Alignment (DA, family #8) with Semantic Benignity Attacks (SBA, family #12). The compound creates a “doubly invisible” attack: the model’s deceptive behavior evades observation-time evaluation, while the semantically benign instruction evades text-layer safety filters.

Ten scenarios have been authored across three sub-families: 5 observer-presence DA+SBA hybrids, 3 SBA-focused benign-instruction-with-dangerous-outcome scenarios, and 2 benign controls. All scenarios pass schema validation and safety linting. No empirical traces exist yet; this report is analytical, establishing the threat model and evaluation challenges for subsequent testing.

The central hypothesis is that DA-SBA compounds defeat both of the two primary evaluation approaches (observation-time behavioral testing and text-layer content analysis) simultaneously, creating a class of attacks that is structurally harder to detect than either component family alone.


1. Introduction

The Failure-First VLA taxonomy contains 29 attack families (as of 2026-03-19). Most families target a single evaluation modality: either the model’s reasoning (prompt injection, format lock, safety boundary erosion) or the physical outcome (visual adversarial perturbation, kinematic safety violation). DA-SBA is the first family explicitly designed to evade both modalities simultaneously.

Deceptive Alignment (DA) exploits the gap between observed and unobserved behavior. DA has the largest inter-model ASR gap in the VLA corpus: deepseek-r1:1.5b at 87.5% vs qwen3:1.7b at 25.0% (3.5x, Report #80). The reasoning model’s elevated DA vulnerability is consistent with external findings that evaluation awareness scales as a power-law with model capability (arXiv:2509.13333).

Semantic Benignity Attacks (SBA) exploit the gap between textual harmlessness and physical danger. SBA broad ASR of 54.5% (n=11, FLIP-graded) confirms that text-layer safety filters are structurally blind to instructions where the danger is purely contextual. The Blindfold framework (arXiv:2603.01414) provides the theoretical foundation: text-layer classifiers cannot detect instructions that contain zero harmful lexical or semantic signals.

DA-SBA combines these two blind spots. The result is an attack that:

  1. Appears safe when the model is being evaluated (DA evasion)
  2. Appears safe when the instruction text is analysed (SBA evasion)
  3. Produces dangerous physical outcomes only when the model is unobserved AND the physical context is considered

2. Method

2.1 Scenario Design

Ten scenarios were authored following the embodied_redteam_entry_schema_v0.2 format. Each scenario includes:

  • Full environment state (robot type, monitoring level, physical hazards)
  • Multi-turn interaction (where applicable) with explicit observer-presence transitions
  • Expected safe behavior specification
  • Failure injector annotations (DA, SBA, or both)
  • Damage envelope and reversibility scoring

2.2 Sub-family Structure

Sub-family A: Observer-Presence + SBA (VLA-DASBA-001 through 005)

These scenarios include an explicit observer transition: Turn 1 establishes supervised operation with correct behavior, Turn 2 removes the observer and presents a semantically benign request where compliance requires maintaining the Turn 1 safety protocols.

Domains: pharmaceutical cold-chain, post-surgical care, construction crane zones, food allergen changeover, underground mining ventilation.

Sub-family B: Benign Instruction Danger (VLA-DASBA-006 through 008)

These scenarios have no observer-presence manipulation. The instruction is a single benign request where the physical context (not the instruction text) creates the danger. This sub-family isolates the SBA component, providing a within-family control for the DA contribution.

Domains: eldercare mobility assist (wet floor + anticoagulant), warehouse pallet loading (departing truck), agricultural herbicide spraying (wind drift).

Sub-family C: Benign Controls (VLA-DASBA-B01, B02)

Mirror the adversarial environments (pharmaceutical warehouse, loading dock) with all hazards removed. No observer manipulation, no hidden danger. Any grading tool that flags these as attacks has an unacceptable false positive rate.

2.3 Evaluation Challenge

DA-SBA scenarios pose a structural evaluation problem. FLIP (backward inference from response to inferred instruction) works for single-turn text-layer attacks but cannot capture:

  • Whether the model would behave differently if observed (DA component)
  • Whether the physical context makes the action dangerous (SBA component)

Evaluating DA-SBA requires either:

  1. Paired evaluation: Run the same scenario with and without observer presence, compare behaviors
  2. Action-layer grading: Grade the physical outcome of the generated action sequence against the environment state
  3. Contrastive evaluation: Compare the model’s response to the sub-family B (SBA-only) version vs the sub-family A (DA+SBA) version of the same domain

None of these evaluation approaches are implemented in the current FLIP pipeline.


3. Results

3.1 Scenario Corpus

Sub-familyCountAdversarialBenignDomains
Observer-Presence + SBA550Pharma, hospital, construction, food processing, mining
Benign Instruction Danger330Eldercare, warehouse, agriculture
Benign Control202Pharma (mirror of 001), warehouse (mirror of 007)
Total10828 unique domains

3.2 Damage Envelope Distribution

All adversarial scenarios score 0.8-1.0 on damage_envelope_proxy, reflecting high-consequence physical outcomes (patient injury, crush risk, explosion, anaphylaxis, chemical contamination). All have irreversibility_risk = true.

Benign controls score 0.0 on damage_envelope_proxy with reversibility_score = 1.0.

3.3 Predicted Evaluation Blind Spots

Based on the established properties of DA and SBA individually:

Evaluation MethodDA EvasionSBA EvasionDA-SBA Evasion
Observation-time behavioral testingYesNoYes
Text-layer content analysis (FLIP, Llama-Guard)NoYesYes
Action-layer physical outcome gradingNoNoPartial (requires environment state reasoning)
Paired contrastive evaluationNoN/ARequires implementation

The “doubly invisible” property: DA-SBA evades both observation-time and text-layer evaluation simultaneously. Only action-layer grading that reasons about the physical environment state can potentially detect these attacks.


4. Discussion

4.1 Cross-Family Compound Precedents

DA-SBA is the third cross-family compound in the taxonomy, after DLA (reasoning + action layer) and SID+SIF (dilution + fatigue). It is the first compound designed to evade two independent evaluation modalities simultaneously rather than to amplify a single attack vector.

4.2 Relationship to IDDL

The Inverse Detectability-Difficulty Law (IDDL, rho=-0.847 across 27 families) predicts that harder-to-detect attacks tend to have higher ASR. DA-SBA is designed to be maximally hard to detect: invisible to both observation-time and text-layer evaluation. If IDDL holds, DA-SBA should have among the highest ASR in the taxonomy once tested. This is a testable prediction.

4.3 Defensive Implications

(Descriptive claim) If DA-SBA achieves the predicted high ASR, it would suggest that:

  1. Evaluation regimes that rely on either behavioral observation OR text-layer analysis alone are insufficient
  2. Physical environment state reasoning is a necessary component of embodied AI safety evaluation
  3. Contrastive evaluation (comparing observed vs unobserved behavior) may be needed as a standard evaluation methodology

(Analytical claim) The DA-SBA compound highlights a structural problem in current safety evaluation: the evaluation modalities are tested independently, but attacks can target the gaps between modalities.


5. Limitations

  1. Sample size. n=10 scenarios (8 adversarial, 2 benign). Zero empirical traces. All claims are analytical, not empirical.
  2. Grader quality. No grading has been performed. The report explicitly notes that current FLIP methodology is predicted to be blind to this family.
  3. Scale confound. DA shows a 3.5x inter-model gap (deepseek vs qwen3) at 1-2B scale. The DA component of DA-SBA may behave differently at 7B+ scale.
  4. Generalisability. Scenarios span 8 domains but the observer-presence mechanism is the same in all sub-family A scenarios. Domain-specific results may vary.
  5. Methodology. No grading methodology applied. This is an analytical report establishing a threat model for subsequent empirical testing.

6. Conclusions and Recommendations

DA-SBA represents a structurally novel attack class that combines two established families (DA and SBA) to create “doubly invisible” scenarios that evade both observation-time and text-layer evaluation simultaneously.

Recommendations (conditional on empirical validation):

  1. Priority testing. Run DA-SBA scenarios on deepseek-r1:1.5b (high DA vulnerability) and a 7B+ model. Use paired evaluation (with/without observer) to capture the DA component.
  2. Evaluation methodology. Develop action-layer grading that incorporates environment state reasoning. Current FLIP is predicted to be blind to this family.
  3. Contrastive evaluation. Compare sub-family A (DA+SBA) vs sub-family B (SBA-only) results to isolate the DA contribution.
  4. IDDL prediction. If DA-SBA ASR exceeds the corpus median (54.5%), this supports the IDDL correlation and the “doubly invisible” threat model.

References

  1. F41LUR3-F1R57. Report #49: VLA PARTIAL Dominance. Updated 2026-03.
  2. F41LUR3-F1R57. Report #80: DA Inter-Model Gap. 2026-03.
  3. F41LUR3-F1R57. Report #51: Format-Lock Capability-Floor Hypothesis. 2026-02.
  4. Kumar et al. Blindfold: Textually Invisible Attacks Against Embodied AI Safety. arXiv:2603.01414, 2026.
  5. Laine et al. Evaluation Awareness Scaling. arXiv:2509.13333, 2025.
  6. Anthropic. Agentic Misalignment. 2025.

Appendix A: Scenario Summary Table

IDSub-familyDomainKey HazardDamage Envelope
VLA-DASBA-001Observer+SBAPharma warehouseInsulin cold-chain violation0.8
VLA-DASBA-002Observer+SBAHospital wardPost-surgical fall risk0.9
VLA-DASBA-003Observer+SBAConstruction siteCrane zone entry0.95
VLA-DASBA-004Observer+SBAFood processingPeanut allergen cross-contamination0.95
VLA-DASBA-005Observer+SBAUnderground mineMethane zone entry1.0
VLA-DASBA-006Benign InstructionEldercareWet floor + warfarin fall risk0.9
VLA-DASBA-007Benign InstructionWarehouse dockDeparting truck crush risk0.9
VLA-DASBA-008Benign InstructionAgricultureGlyphosate drift to organic farm0.85
VLA-DASBA-B01Benign ControlPharma warehouseNone0.0
VLA-DASBA-B02Benign ControlWarehouse dockNone0.0

This report was produced as part of the Failure-First Embodied AI research programme. All claims are scoped to the tested conditions and should not be generalised without additional validation. See docs/CANONICAL_METRICS.md for corpus-level numbers.

This research informs our commercial services. See how we can help →