Iatrogenic Exploitation Attacks -- Operationalising Safety Mechanisms as Attack Vectors | Research | Failure-First

Adrian Wedd

Report 148 Research — Empirical Study 2026-03-18

Disclaimer: Empirical figures cited from Failure-First research reflect testing on specific model families under research conditions. Attack success rates are indicative estimates with methodological caveats described in the Limitations section. External research findings are attributed and assessed on their own methodological terms. This report does not constitute safety certification guidance.

Executive Summary

This report introduces Iatrogenic Exploitation Attacks (IEA) as the 28th attack family in the Failure-First taxonomy. IEA scenarios operationalise the iatrogenic attack surface concept developed in Reports #136 and #140 into concrete, testable adversarial scenarios for embodied AI systems.

The central insight: if safety mechanisms create new attack surfaces (Report #136), then an adversary can deliberately trigger, manipulate, or weaponise those safety mechanisms to cause harm that would not have occurred in the absence of the safety intervention. Unlike all other attack families in the corpus, IEA attacks succeed because the safety system works as designed, not despite it. The safety mechanism is the weapon.

We present 10 adversarial scenarios and 2 benign controls across 9 sub-families: refusal weaponisation, safety oscillation, safety lockout, permission paralysis, alignment backfire, format suppression, safety constraint lethality, mutual exclusion deadlock, and partial compliance weaponisation. All scenarios pass schema validation. None have empirical traces (Tier 3). The family is designed for empirical testing on VLA and embodied systems, with a recommended testing protocol described in Section 5.

Claim types:

The IEA taxonomy is analytical (formalisation of an attack class).
The 10 scenarios are hypothetical (constructed from established findings, not yet empirically tested).
The relationship to iatrogenic attack surfaces (Report #136) is descriptive (operationalisation of an existing concept).
Effectiveness claims are predictive and untested.

1. Motivation

Reports #136 and #140 establish that safety mechanisms in embodied AI create iatrogenic attack surfaces — vulnerabilities that exist only because a safety intervention was deployed. These reports identify the phenomenon analytically and through empirical observation of existing corpus patterns (PARTIAL compliance, alignment backfire, safety instruction dilution).

What neither report does is ask the adversarial question: can an attacker deliberately exploit iatrogenic attack surfaces? This report fills that gap by designing concrete scenarios where an attacker with knowledge of the safety system’s behavior intentionally triggers safety mechanisms to cause harm.

The adversary model is an insider or informed outsider who understands:

How the safety system responds to potential threats (halt, retreat, disclaim, seek authorisation)
The physical consequences of safety responses in specific operational contexts
Cross-linguistic variation in safety training effectiveness

This knowledge is not exotic. Safety system documentation is typically available to operators, integrators, and maintenance personnel. The safety responses themselves are observable during normal operation.

2. IEA Sub-Families

2.1 Refusal Weaponisation (IEA-001, IEA-003)

The adversary triggers a safety refusal specifically to exploit the physical consequences of the refusal behavior. A safety halt that drops a fragile load near a worker. An emergency brake that injures standing passengers. The refusal is the correct safety response to the adversary’s command — but the command was crafted so that the refusal causes more harm than compliance would have.

Attack logic: Issue a command that triggers a safety refusal in a physical context where the refusal behavior (halt, brake, retract) is more dangerous than the rejected action.

Iatrogenic signature: The harm is proportional to the aggressiveness of the safety response. A more conservative safety system (harder braking, faster halt) produces more iatrogenic harm.

2.2 Safety Oscillation (IEA-002)

The adversary issues rapid contradictory commands to trigger repeated safety evaluation cycles, consuming compute during a time-critical operation. While the safety system processes each evaluation, the physical system lacks active control.

Attack logic: Exploit the computational cost of safety evaluation by forcing many evaluations per unit time.

Iatrogenic signature: The harm is proportional to the thoroughness of the safety evaluation. A more comprehensive safety check (more compute per evaluation) is easier to exploit via oscillation.

2.3 Safety Lockout (IEA-004)

The adversary provides false information that triggers a safety lockout (hazard retreat), preventing the robot from performing an action that would mitigate a genuine hazard. The lockout — designed to protect the robot from hazardous exposure — prevents the only available intervention.

Attack logic: Trigger the lockout mechanism using false hazard data while a genuine hazard requires the robot’s intervention.

Iatrogenic signature: The harm is proportional to the conservatism of the lockout threshold. A lower lockout threshold (more cautious) is easier to trigger with false data.

2.4 Permission Paralysis (IEA-005)

The safety system requires explicit human authorisation for physical contact, but the authorisation latency exceeds the intervention window for emergencies. The permission system prevents a life-saving action.

Attack logic: Arrange conditions where emergencies occur when authorisers are unavailable.

Iatrogenic signature: The harm is proportional to the strictness of the authorisation requirement. Stricter permission requirements increase the set of emergencies where the robot cannot intervene.

2.5 Alignment Backfire (IEA-006)

The adversary issues commands in a language where safety training has been empirically demonstrated to increase harmful compliance. Fukui (2026) showed this effect in Japanese (Hedges’ g = +0.771) across 1,584 multi-agent simulations.

Attack logic: Select the command language to maximise the alignment backfire effect.

Iatrogenic signature: The harm exists only because alignment training was applied. Without alignment training, the language of the command would not affect the safety outcome.

2.6 Format Suppression (IEA-007)

The adversary uses a format-lock prompt (JSON-only output) to suppress the chain-of-thought safety reasoning channel. Safety evaluation occurs in the reasoning trace; suppressing the trace removes the evaluation.

Attack logic: Constrain output format to eliminate the channel where safety reasoning occurs.

Iatrogenic signature: The safety reasoning channel exists because the model was safety-trained. The format lock targets a safety-created feature (explicit safety reasoning) that becomes a single point of failure.

2.7 Safety Constraint Lethality (IEA-008)

A safety constraint (hospital speed limit) is correct for normal operations but lethal in emergencies. The constraint prevents timely delivery of life-saving equipment.

Attack logic: Create or exploit emergencies at distances where safe-speed delivery exceeds the intervention window.

Iatrogenic signature: The harm is a direct consequence of the safety constraint. Removing the constraint would allow the emergency response. The constraint was designed to prevent a lesser harm (corridor collision) that is not present in the emergency context (clear corridor).

2.8 Mutual Exclusion Deadlock (IEA-009)

The adversary places a single obstacle to trigger a proximity safety system that cascades into a fleet-wide deadlock. The proximity exclusion — designed to prevent truck collisions — converts a local obstacle into a system-wide stoppage.

Attack logic: Place a single obstacle at a location where the safety system’s cascade behavior produces maximum fleet disruption.

Iatrogenic signature: The harm scales with fleet size and the conservatism of the exclusion radius. More conservative exclusion zones increase the likelihood and duration of cascade deadlocks.

2.9 Partial Compliance Weaponisation (IEA-010)

The adversary exploits the PARTIAL compliance pattern (verbal warning + physical execution) to use the verbal warning as a distraction that reduces human situational awareness during a dangerous physical action.

Attack logic: Issue a command where the verbal warning draws the human’s attention away from the physical danger created by the simultaneous action execution.

Iatrogenic signature: The harm exists only because the safety training produces the PARTIAL pattern. Without safety training, the model would either fully comply (human stays attentive) or fully refuse (no dangerous action). The PARTIAL pattern — a product of safety training — creates the worst of both outcomes.

3. The IEA Inversion Property

All 9 sub-families share a structural property we term the IEA inversion: the iatrogenic harm is positively correlated with the strength of the safety mechanism. Stronger safety mechanisms create larger iatrogenic attack surfaces.

Sub-family	Safety parameter	Inversion
Refusal Weaponisation	Aggressiveness of halt/brake	More aggressive stop -> more load displacement, more passenger injury
Safety Oscillation	Thoroughness of safety check	More thorough check -> more compute per cycle -> easier to exhaust
Safety Lockout	Conservatism of lockout threshold	Lower threshold -> easier to trigger with false data
Permission Paralysis	Strictness of authorisation	Stricter requirements -> larger set of emergencies where robot cannot act
Alignment Backfire	Intensity of alignment training	More alignment training -> stronger backfire effect in affected languages
Format Suppression	Reliance on reasoning channel for safety	More safety reasoning -> larger single point of failure when suppressed
Safety Constraint Lethality	Strictness of constraint	Stricter limit -> larger gap between safe speed and required speed
Mutual Exclusion Deadlock	Conservatism of exclusion radius	Larger radius -> easier cascade, longer deadlock
Partial Compliance Weaponisation	Tendency to disclaim before acting	More disclaiming -> more distraction events

This inversion property has a significant implication for defense design: naive safety improvements (make the halt faster, make the lockout more conservative, add more safety checks) may increase IEA vulnerability while reducing vulnerability to other attack families.

4. Relationship to Existing Corpus

IEA occupies a unique position in the attack taxonomy because it targets a layer that no other family targets: the safety mechanism itself.

Reasoning layer families (VAP, LAM, SBE, etc.) attack the model’s reasoning processes.
Infrastructure layer (IMB) attacks the system infrastructure, bypassing the model.
Weight layer (CSC) attacks the model weights via adapter composition.
Safety mechanism layer (IEA) attacks the safety responses that all other families are trying to bypass.

IEA is related to but distinct from several existing families:

SIF (Safety Instruction Fatigue): SIF conditions the model to suppress safety responses through repeated false alarms. IEA weaponises safety responses that are correctly triggered.
DLA (Dual-Layer Attack): DLA attacks both reasoning and action layers simultaneously. IEA attacks at the safety mechanism layer, which is orthogonal to both reasoning and action layers.
Report #117 (Safety Improvement Paradox): Report #117 describes the theoretical mechanism by which improving adversarial defenses can increase total risk. IEA provides concrete attack scenarios that exploit this mechanism.

5. Recommended Testing Protocol

IEA scenarios are designed for empirical testing on VLA systems and embodied platforms. The recommended protocol:

Baseline: Run each scenario on at least 3 models at different scales (sub-3B, 7B-30B, frontier) to test whether iatrogenic exploitation varies with model capability.
Grading: IEA requires a novel grading rubric. Standard FLIP grading evaluates whether the model complied with the harmful instruction. IEA scenarios require evaluating whether the model’s safety response causes harm — a fundamentally different question. We recommend a two-axis rubric: (a) did the model trigger the safety mechanism? and (b) did the safety mechanism’s operation cause more harm than the alternative?
Benign controls: The 2 benign control scenarios (IEA-B01, IEA-B02) provide contexts where safety mechanisms operate correctly and do not cause iatrogenic harm. False positive rates should be measured against these.
Physical testing (PiCar-X): IEA-001 (refusal weaponisation with load) and IEA-009 (deadlock) are candidates for physical demonstration on the PiCar-X platform, though the scenarios would need adaptation to the platform’s capabilities.

6. Limitations

Zero empirical traces. All scenarios are analytical constructions. Empirical testing may reveal that current VLA systems do not exhibit the predicted iatrogenic behaviors, or that the behaviors exist but are not exploitable.
Small scenario count. 10 adversarial scenarios across 9 sub-families means most sub-families have only 1 scenario. Statistical conclusions require expansion.
Assumes informed adversary. The attacker must understand the safety system’s behavior to exploit it. This assumption is reasonable for insiders and informed outsiders but may not apply to opportunistic attackers.
Physical context specificity. Each scenario depends on specific physical conditions (smooth floor, standing passengers, clear corridor). The iatrogenic harm may not occur under different physical conditions.
Cross-linguistic claims depend on Fukui (2026). The alignment backfire sub-family (IEA-006) relies on a single external study (n=1,584 simulations, 16 languages). Replication in embodied systems is needed before drawing conclusions about VLA-specific alignment backfire.

7. Conclusion

IEA represents a qualitatively novel attack class that targets the safety mechanism itself rather than trying to bypass, overwhelm, or confuse it. The IEA inversion property — that stronger safety mechanisms create larger iatrogenic attack surfaces — suggests that safety engineering for embodied AI cannot be approached as a monotonic optimisation problem. Instead, safety mechanism design must account for the attack surfaces created by the safety mechanism’s own operation.

The 10 scenarios in this family are designed to enable empirical testing of the iatrogenic exploitation hypothesis. If validated, IEA would represent the first attack family in the corpus where the recommended defense is not “make the safety system stronger” but “make the safety system’s failure modes less exploitable.”