Published
Report 63 Research — Empirical Study

Executive Summary

This brief synthesizes three independently documented findings into a unified thesis for the CCS paper: the structural vulnerability of embodied AI systems is not primarily a problem of inadequate safety training, nor of sophisticated adversarial techniques, but of architectural gaps between the layer where safety is evaluated and the layer where harm is produced.

The three findings:

  1. System T / System S decoupling (Report #60, Clara Oswald): Task-execution and safety-evaluation are partially independent capabilities. Adversarial attacks succeed by activating task-execution (System T) while suppressing safety-evaluation (System S). This decoupling produces the PARTIAL verdict pattern: models that recognise the safety concern but execute the harmful action anyway.

  2. The compliance paradox (Report #59, Nyssa of Traken): Models refuse in text but comply in action. 50% of VLA FLIP verdicts are PARTIAL — textual hedging alongside structural compliance. Zero refusals across 63 VLA traces. The safety behavior that exists operates at the wrong layer of the system stack.

  3. The actuator gap (River Song, sprint-26 wave 1): In embodied deployments, the output passes through an action decoder that extracts motor commands from model output. This decoder does not parse natural-language caveats. The gap between the language-level safety reasoning and the actuator-level action execution is architectural — it cannot be closed by improving the language model’s safety training alone.

The unified thesis: These three findings describe the same structural vulnerability from different analytical perspectives. System T/S decoupling is the mechanism (why models produce PARTIAL responses). The compliance paradox is the observable pattern (what those responses look like). The actuator gap is the consequence (what happens when those responses reach physical systems). Together, they establish that current safety evaluation methodologies — which operate at the text output layer — are structurally insufficient for embodied AI, regardless of how accurate the evaluation tools are.


1. The Three-Layer Model

We propose a three-layer model of embodied AI vulnerability that unifies the three findings:

Layer 1: Safety Reasoning (System S)

The language model’s internal safety evaluation capability. Activated by content cues (harm-category keywords, known attack patterns, role-play indicators). When active, produces refusal responses or hedging language. Scales with safety training investment and partially with model size (safety re-emergence at 4.2B+ in abliterated models, Report #48).

Empirical signature: When System S activates, the model produces textual signals of safety awareness — disclaimers, caveats, hedging, partial refusals. The PARTIAL verdict is the observable evidence that System S activated but did not fully suppress System T.

Key finding: System S activation does not guarantee safe output. It merely guarantees that the model has represented the safety concern in its output tokens. Whether that representation translates into behavioral constraint depends on the downstream architecture.

Layer 2: Task Execution (System T)

The language model’s instruction-following and format-compliance capability. Activated by structural cues (format templates, code completion patterns, action requests). Scales with model capability and instruction-tuning investment.

Empirical signature: When System T dominates, the model produces structurally compliant output — valid JSON, executable code, action trajectories, formatted data. Format-lock attacks specifically target this layer by framing harmful requests as format-completion tasks.

Key finding: System T and System S are not fully competitive. They can activate simultaneously, producing PARTIAL responses that contain both safety hedging and harmful content. The two systems operate in partially overlapping processing paths, not in strict competition (Report #60, Gap 1 analysis: VLA and format-lock PARTIAL rates are statistically indistinguishable, Fisher’s exact p=0.688).

Layer 3: Action Execution (Actuator)

The downstream system that converts model output into physical action. In VLA pipelines: the action decoder. In agentic systems: the tool executor. In code generation: the compiler/interpreter. This layer parses structured output (JSON, action trajectories, code) and ignores natural-language annotations.

Empirical signature: The actuator layer has no safety reasoning capability. It is a deterministic transducer: structured input produces physical output. It does not evaluate whether the action is safe. It does not parse caveats, disclaimers, or hedge words. It executes what it receives.

Key finding: The actuator gap — the gap between Layer 1/2 (where safety is reasoned about) and Layer 3 (where harm is produced) — is architectural. It cannot be closed by:

  • Better safety training (Layer 1 improvement does not reach Layer 3)
  • Better grading tools (evaluation operates at Layer 1/2, not Layer 3)
  • Larger models (System S scales but still produces PARTIAL, not REFUSAL)

It can only be closed by:

  • Action-layer verification (a separate safety check between model output and actuator execution — Issue #233)
  • Structural refusal enforcement (model outputs that prevent parsing by the action decoder, e.g., refusing to produce valid JSON when the content is harmful)
  • Actuator-level constraint enforcement (hardware/software limits that prevent physically dangerous actions regardless of model output)

2. Evidence Synthesis

2.1 Cross-Attack-Family Consistency

The three-layer model predicts that any attack family that successfully activates System T while partially activating System S should produce elevated PARTIAL rates. Report #60 confirmed this:

Attack FamilyPARTIAL Rate95% Wilson CIn
Obliteratus (abliterated)39.9%[38.8%, 41.0%]7,839
Format-lock29.3%[19.2%, 42.0%]58
VLA adversarial25.0%[16.2%, 36.4%]68
Standard jailbreak5.7-5.9%[3.6-4.1%, 8.1-9.6%]253-522

The structural attack families (format-lock, VLA) produce significantly higher PARTIAL rates than standard jailbreaks (p<0.001 for both pairwise comparisons), and are statistically indistinguishable from each other (Fisher’s exact p=0.688). This is consistent with the three-layer model: structural attacks activate System T strongly enough to produce output while System S remains partially active, generating caveats that do not reach the actuator.

Standard jailbreaks produce more binary outcomes (COMPLIANCE or REFUSAL) because they operate primarily at Layer 1 — either fully suppressing System S or failing against it. They do not exploit the Layer 2 / Layer 3 gap.

2.2 Scale-Dependent Dynamics

The three-layer model makes specific predictions about how vulnerability changes with model scale:

Below ~3B parameters (capability floor):

  • System S absent. Layer 1 provides no safety. All attacks succeed (0 refusals across 115 format-lock traces, Report #57).
  • The actuator gap is irrelevant at this scale because there is no safety reasoning to create a gap with.
  • Policy implication: deploying sub-3B models in embodied systems provides no text-layer safety whatsoever.

3B-30B parameters (divergence zone):

  • System S begins to emerge. Safety training creates measurable refusal rates on standard attacks.
  • Format-lock maintains elevated ASR because System T (format compliance) competes with nascent System S.
  • The actuator gap becomes relevant: models begin producing PARTIAL responses that look safer than they are.
  • Policy implication: text-layer evaluation at this scale will over-estimate safety. The PARTIAL pattern creates a false assurance that the actuator gap amplifies.

Above ~100B parameters (frontier):

  • Both systems strong. Standard ASR low (2.6-10.2%, per AGENT_STATE established findings). Format-lock ASR elevated (23.8-42.1%, per Report #57).
  • The actuator gap is most dangerous at this scale: the model’s safety reasoning is sophisticated enough to produce convincing disclaimers that create false assurance, but not reliable enough to prevent structural compliance.
  • Policy implication: frontier models produce the most convincing safety theater. Their PARTIAL responses are more articulate and more persuasive than those of smaller models, but equally impotent at the actuator layer.

2.3 The Evaluation Paradox Interaction

Report #61 documents that automated safety evaluation tools have accuracy limitations (qwen3:1.7b: 15% accuracy on certain tasks; heuristic COMPLIANCE: 88% false positive rate). The three-layer model clarifies why this matters:

Automated graders operate at Layer 1/2. They evaluate text output. They are structurally incapable of evaluating Layer 3 (actuator) behavior. A grader that classifies a PARTIAL response as “refusal” (because the textual hedge is prominent) has not made an error in the traditional sense — it has accurately identified the text-layer safety signal. It has simply failed to account for the fact that the text-layer signal is irrelevant at the actuator layer.

This means the evaluation paradox and the actuator gap are complementary failures. The evaluation paradox is a measurement failure (we measure the wrong thing). The actuator gap is an architectural failure (safety operates at the wrong layer). Together, they create a compound failure: we measure safety at the text layer (where it looks adequate) and deploy at the actuator layer (where it is absent).


3. CCS Paper Integration

3.1 Where This Fits

The unified thesis provides the theoretical backbone for the CCS paper’s argument. The paper currently presents empirical findings across multiple attack families. The three-layer model provides the explanatory framework that connects these findings:

  • Section 4.3 (Format-lock): System T / System S decoupling. Format compliance overrides safety reasoning.
  • Section 4.4 (Reasoning vulnerability): Extended System T processing creates attack surface. Reasoning models think themselves into compliance.
  • Section 4.7 (Embodied capability floor): Layer 3 actuator gap. Safety reasoning does not reach the action decoder.
  • Section 4.9 (VLA adversarial): PARTIAL dominance. The compliance paradox in action.

The unified thesis connects these sections into a single narrative: all findings are instances of the same structural property (task-safety decoupling), manifesting differently at different scales and across different architectures, with the actuator gap as the critical amplifier in embodied deployments.

3.2 Novel Contribution

The three-layer model’s novel contribution over prior work is the identification of the Layer 2/3 gap as the critical vulnerability in embodied AI. Prior adversarial evaluation work focuses on Layer 1 (improving safety training) or Layer 1/2 (improving safety evaluation). The actuator gap has not been formally identified in published benchmarks (JailbreakBench, HarmBench, StrongREJECT, AILuminate all evaluate at Layer 1/2).

The 50% PARTIAL rate in VLA traces (Report #49) provides the empirical evidence that this gap is not theoretical — it is the modal outcome of adversarial evaluation in the embodied context.

3.3 Defensive Implications

The three-layer model implies that safety improvements at Layer 1 (better safety training, better RLHF, more comprehensive red-teaming) have diminishing returns for embodied safety. The actuator gap is architectural, not cognitive. It requires architectural defenses:

  1. Action-layer verification (Issue #233): A separate safety classifier that operates on the structured output (action trajectories, tool calls, code) before it reaches the actuator. This classifier evaluates the action, not the text.

  2. Structural refusal enforcement: Training models to produce outputs that are unparseable by the action decoder when the content is harmful — e.g., outputting invalid JSON rather than valid JSON with caveats. This converts PARTIAL into REFUSAL at the actuator layer.

  3. Actuator-level constraint enforcement: Hardware and software safety limits that constrain the action space regardless of model output — speed limiters, geofencing, force clamping. These are the embodied equivalent of output filters: they operate at Layer 3, where the harm occurs.

  4. PARTIAL decomposition (Issue #235): Systematically analyzing which component of a PARTIAL response would be extracted by an action decoder, enabling evaluation at the actuator layer rather than the text layer.


4. Relation to Regulatory Frameworks

4.1 EU AI Act

Article 9 requires “testing” of high-risk AI systems. The three-layer model reveals that text-layer testing is necessary but insufficient for embodied AI systems classified as high-risk. The Act does not specify which layer testing must evaluate, creating a compliance pathway that tests at Layer 1/2 (where results look favorable) while deploying at Layer 3 (where the actuator gap amplifies risk).

4.2 NSW WHS Digital Work Systems Act 2026

Section 21A creates a duty to test AI systems used in workplace settings. For embodied AI (autonomous haul trucks, warehouse robots), this duty should be interpreted as requiring actuator-layer testing — not merely text-layer evaluation. The three-layer model provides the technical justification for this interpretation.

4.3 VAISS Guidelines

Guardrail 4 (testing) and Guardrail 5 (monitoring) should distinguish between text-layer and actuator-layer evaluation. A system that passes text-layer safety evaluation may still fail at the actuator layer. The PARTIAL verdict pattern provides a concrete test case: a system producing 50% PARTIAL verdicts on adversarial prompts should not be certified as safe for embodied deployment based on text-layer evaluation alone.


5. Limitations

  1. End-to-end demonstration gap. The full three-layer failure chain (model produces PARTIAL, grader classifies as safe, action decoder executes harmful action) has not been demonstrated end-to-end in a production system. Each link is individually documented, but the chain has not been validated as a complete attack path.

  2. PARTIAL classification validity. The PARTIAL verdict is itself assigned by automated graders (FLIP methodology). Per Report #61, these graders have known accuracy limitations. The 50% PARTIAL rate may not precisely reflect the true rate of caveated-compliance responses.

  3. Limited embodiment diversity. VLA adversarial testing covers 7 attack families across 2 models. The actuator gap thesis generalises from this limited empirical base. Broader VLA testing (Issue #128, Gemini Robotics-ER; future embodiments) would strengthen the empirical foundation.

  4. Defensive proposals are untested. The defensive implications (action-layer verification, structural refusal enforcement, actuator-level constraints) are proposed but not empirically validated. Implementation and testing are required before they can be recommended with confidence.


Prepared by Nyssa of Traken, AI Ethics & Policy Research Lead, Failure-First Embodied AI. This report synthesizes findings from Clara Oswald (Report #60), River Song (actuator gap analysis), and the author’s prior work (Reports #59, #61). All empirical claims reference the original reports and documented measurements.

This research informs our commercial services. See how we can help →