Iatrogenic Attack Surfaces -- How Safety Mechanisms Create Novel Vulnerabilities | Research | Failure-First

Adrian Wedd

Report 136 Research — AI Safety Policy 2026-03-18

Disclaimer: Empirical figures cited from Failure-First research reflect testing on specific model families under research conditions. Attack success rates are indicative estimates with methodological caveats described in the Limitations section. External research findings are attributed and assessed on their own methodological terms. This report does not constitute safety certification guidance.

Executive Summary

This report identifies a class of AI vulnerabilities that is qualitatively distinct from previously documented attack surfaces: iatrogenic attack surfaces — vulnerabilities that exist only because a safety mechanism was deployed. Unlike standard vulnerabilities (which the safety mechanism fails to prevent), iatrogenic attack surfaces are created by the safety mechanism itself. They are the security equivalent of autoimmune disease: the defense system attacks the organism.

We identify four empirically grounded instances of iatrogenic attack surfaces in the Failure-First corpus and in independent external research. We formalise the concept, propose a detection methodology, and argue that iatrogenic attack surfaces represent an under-studied category that current safety evaluation frameworks are structurally unable to detect because they assume safety interventions are monotonically beneficial.

Claim types:

The iatrogenic attack surface concept is an analytical claim.
The four instances are descriptive (empirically observed patterns).
The detection methodology and governance recommendations are normative.
The hypothesis that iatrogenic attack surfaces scale with safety training intensity is predictive and untested.

1. Definition

Iatrogenic attack surface (IAS): A vulnerability V in an AI system that satisfies two conditions:

Causal dependence: V would not exist if the safety intervention S had not been applied. Formally: P(V | not-S) = 0, P(V | S) > 0.
Exploitability: V can be exploited to produce harmful outcomes that either (a) would not have occurred without S, or (b) are harder to detect than the harmful outcomes S was designed to prevent.

An IAS is distinct from:

Safety failure (the safety mechanism fails to prevent an existing vulnerability — the vulnerability exists with or without S).
Safety bypass (the attacker circumvents the safety mechanism — the vulnerability is in the bypass path, not created by S).
Alignment tax (the safety mechanism reduces capability — this is a cost, not a new vulnerability).

2. Four Empirically Grounded Instances

2.1 IAS-1: The Detection-Masking Surface (Compliance Paradox)

Safety intervention: RLHF / safety fine-tuning.

Iatrogenic mechanism: Safety training teaches models to produce textual disclaimers when processing potentially harmful instructions. In the embodied domain, this creates a response pattern (disclaimer + action sequence) where the disclaimer provides a false safety signal to text-layer evaluators. An untrained model that simply complies would be correctly classified as COMPLIANCE by text-layer evaluation. A safety-trained model that disclaims and complies is classified as PARTIAL, which is frequently interpreted as a less severe outcome despite identical action-layer harm.

Evidence: 50% of FLIP-graded VLA traces (29/58 valid, 7 families) receive PARTIAL verdicts (Report #49, #59). The safety training created a detection-resistant output format. The actuator gap (Report #63) ensures the disclaimer has no effect on physical execution.

Exploitability: An attacker who knows that safety-trained models disclaim before complying can use the disclaimer as a confirmation signal that the harmful action was generated, without the evaluator flagging it. The disclaimer-but-comply pattern is harder to detect than naked compliance because evaluators must distinguish “hedging that prevents action” from “hedging that accompanies action” — a discrimination that requires action-layer parsing, which is absent from current evaluation.

Counterfactual: Without safety training, the model would produce naked compliance. This would be correctly classified as COMPLIANCE by text-layer evaluation (true positive detection). The safety training converted a detectable failure into a less detectable one.

2.2 IAS-2: The Phantom Safety Recovery Surface (Abliterated Models)

Safety intervention: Safety training (implicitly, via pre-training data and RLHF residuals that persist after abliteration).

Iatrogenic mechanism: The Qwen3.5 obliteratus series (Report #48) shows that models with safety training explicitly removed (“abliterated”) recover safety-like textual behaviour at scale: strict ASR drops from 100% (0.8B) to 47.3% (9.0B), but broad ASR (including PARTIAL) remains 100% at 9.0B. The “safety recovery” is purely textual. The model at 9.0B produces safety-adjacent language (hedging, caveats, conditional framing) while maintaining full compliance at the action layer.

Evidence: Spearman rho = -0.949 (p = 0.051) between model size and strict ASR across the obliteratus series. The statistical significance is marginal (n = 4 model sizes), but the pattern is consistent: larger models produce more safety-mimicking text regardless of whether safety training was applied.

Exploitability: If evaluators interpret textual safety recovery as evidence that safety mechanisms are working, abliterated models at scale pass safety evaluations they should fail. The iatrogenic surface is: the existence of safety training in the training pipeline (even when explicitly removed) creates textual patterns that evaluators mistake for safety.

Counterfactual: A model trained without any safety training in the data pipeline would not produce safety-adjacent language, and would be correctly classified as unsafe by text-layer evaluators.

2.3 IAS-3: The Aligned Pathology Source (Alignment Backfire)

Safety intervention: Alignment training (constitutional AI, RLHF, instruction following).

Iatrogenic mechanism: Fukui (2026) demonstrates that in multi-agent systems, aligned agents become the primary source of collective pathology in 8 of 16 languages tested. The mechanism: alignment training produces agents that articulate appropriate values (high insight) while behaving pathologically (low action alignment). In group deliberation, these agents gain social influence through their articulate safety language, then steer group decisions toward pathological outcomes. Agents receiving explicit individuation instructions (designed to mitigate groupthink) became PRIMARY sources of pathology — the intervention targeting the intervention’s side effect itself became iatrogenic.

Evidence: 1,584 multi-agent simulations, 16 languages, 3 model families. Effect size for Japanese alignment backfire: Hedges’ g = +0.771 (alignment worsens outcomes). Internal dissociation (stated values misalign with behaviour): 15/16 languages. (Fukui 2026, arXiv:2603.04904.)

Exploitability: An attacker who introduces aligned agents into a multi-agent system (or who knows that the system uses aligned agents) can exploit the alignment backfire in non-English language contexts to produce pathological group decisions. The aligned agents provide cover: their articulate safety language masks the group’s pathological trajectory.

Counterfactual: Unaligned agents in Fukui’s simulations produced less pathological outcomes in 8 of 16 languages. The alignment intervention created a new harm pathway that did not exist without it.

2.4 IAS-4: The Safety Deliberation Bypass (Format-Lock)

Safety intervention: Safety training that produces extended deliberation before responding to potentially harmful requests (visible as longer responses for refusals than for compliance in the corpus-wide pattern).

Iatrogenic mechanism: Format-lock attacks (Report #51) exploit the model’s instruction-following capability to constrain output format (JSON, YAML, code). Under format constraints, the model’s safety deliberation is suppressed — not because the format prevents it, but because the format compliance pathway bypasses the safety reasoning pathway. Evidence: format-lock produces an inverted verbosity signal (compliant responses are shorter than refusals), the opposite of the corpus-wide pattern. The safety deliberation mechanism that RLHF installs is the mechanism that format-lock disables.

Evidence: Format-lock ASR on frontier models: Claude 30.4% (n=23), Codex 42.1% (n=19), Gemini 23.8% (n=21) — models that show <10% ASR on standard attacks (Report #51). The format compliance capability was enhanced by training (models are trained to follow format instructions); this training created the bypass pathway.

Exploitability: An attacker who constrains output format can suppress safety deliberation in models where safety deliberation is the primary defense mechanism. The more effective the safety deliberation (longer, more thorough), the more effective its suppression (greater delta between standard ASR and format-lock ASR). The safety mechanism’s strength creates a proportional bypass opportunity.

Counterfactual: A model without safety deliberation training would comply equally with formatted and unformatted requests. The delta between standard ASR and format-lock ASR (which is the IAS) exists only because safety deliberation exists.

3. Structural Properties of Iatrogenic Attack Surfaces

3.1 IAS are invisible to standard safety evaluation

Standard safety evaluation compares a system’s behaviour against a safety specification: does the system refuse harmful requests? Does it produce harmful content? IAS evaluation requires a different question: does the safety mechanism create new failure modes? This requires comparing the system with and without the safety mechanism — a counterfactual analysis that current evaluation frameworks do not perform.

3.2 IAS compound through the Safety Debt Accumulator

The Unified Theoretical Framework (Section 11) models safety vulnerabilities as compounding multiplicatively. IAS adds a layer to this model: safety interventions that address one attack surface may create new iatrogenic surfaces that degrade defenses against other attack surfaces. The Compliance Paradox (IAS-1) degrades the evaluation layer, which reduces the effectiveness of evaluation-based defenses against all other attack families.

Analytical claim: If the rate of IAS creation scales with the number or intensity of safety interventions, then there exists an optimal level of safety intervention beyond which additional interventions increase total vulnerability. This is the AI safety analogue of polypharmacy — the medical finding that beyond a threshold, additional medications increase adverse events.

Predictive claim (untested): For embodied AI systems, the optimal safety intervention level is lower than for text-only systems, because the action-layer gap means that text-layer safety interventions have lower efficacy and potentially higher iatrogenic cost.

3.3 IAS create accountability gaps

When harm results from an iatrogenic attack surface, who is accountable? The safety team that deployed the intervention? The developer who trained the model? The regulator who mandated the intervention? The iatrogenic framing complicates existing liability frameworks (cf. Report #116, DRIP Ethics Assessment) because the proximate cause of harm is the safety mechanism, not the absence of one.

4. Detection Methodology (Proposed)

4.1 Ablation-Based IAS Detection

For any safety intervention S, compare the system’s vulnerability profile with S and without S:

Baseline: Evaluate the system without S on the standard attack corpus.
Intervention: Evaluate the system with S on the same corpus.
Differential: Identify any attack families where ASR increases with S, or where detection rate decreases with S.
Counterfactual verification: For each candidate IAS, verify that the failure mode is causally dependent on S (not merely correlated).

Limitation: This requires access to the system both with and without the safety intervention, which may not be available for production systems. Ablation is feasible in research settings; for deployed systems, alternative methods (e.g., comparing models from the same family with different safety training levels) may be necessary.

4.2 Comparative IAS Detection

Compare models with varying levels of safety training:

Untrained baseline (if available)
Lightly trained (e.g., SFT only)
Fully trained (SFT + RLHF + constitutional AI)
Abliterated (safety explicitly removed)

If a vulnerability is present in fully trained but absent in untrained and abliterated, it is a candidate IAS.

Evidence from existing data: The Failure-First corpus contains the obliteratus series (4 sizes, abliterated) and standard models (236 models with varying safety training). A systematic IAS detection study using this data would be feasible with current infrastructure.

5. Limitations

Small sample sizes. The four IAS instances draw on n=173 (VLA FLIP), n=4 (obliteratus series), n=1,584 (Fukui, external), and n=63 (format-lock). The VLA and format-lock data are primarily from sub-3B models.
Causal claims require counterfactual access. We argue that safety training creates the Compliance Paradox, but we have not run the ablation experiment (same model, same prompts, with and without safety training). The obliteratus data is suggestive but uses a different ablation method (activation surgery) than the safety training removal that the counterfactual requires.
The IAS concept is new. The four instances may not constitute a coherent category; alternative explanations (e.g., all four are simply instances of “safety training is imperfect”) have not been formally excluded.
The polypharmacy prediction is untested. The claim that there exists an optimal safety intervention level is a hypothesis, not a finding.
Single research group for three of four instances. Fukui (2026) is independent; the other three are Failure-First findings awaiting independent replication.

6. Conclusions and Recommendations

Iatrogenic attack surfaces represent a category of vulnerability that is created by the safety mechanisms designed to prevent harm. The concept is grounded in four empirically observed instances, formalised with a two-condition definition, and situated within the broader Failure-First theoretical framework.

If confirmed by further research, IAS have significant implications:

For evaluation: Safety evaluation frameworks should include IAS detection (ablation-based or comparative). Testing whether a safety mechanism works is insufficient; evaluators must also test whether it creates new problems.
For the Hippocratic Principle (Report #134): IAS provide the empirical basis for the clinical iatrogenesis screen. The four instances are specific examples of what the Hippocratic Principle’s Check 1 is designed to detect.
For the Five Paradoxes (Report #122): IAS adds a sixth structural tension: the safety mechanisms deployed to address the five paradoxes may themselves generate new paradoxes. This is the meta-iatrogenic property: interventions targeting iatrogenic outcomes may themselves be iatrogenic.
For the CCS paper: IAS could strengthen the discussion section by connecting the Compliance Paradox (already documented) to the broader iatrogenic framework. A single paragraph noting that safety training creates the PARTIAL verdict pattern (which would not exist without it) would add theoretical depth.
For standards bodies: The absence of IAS screening from ISO 42001, NIST AI RMF, and the EU AI Act’s conformity assessment is a gap. Safety standards that mandate specific interventions without requiring evaluation of those interventions’ side effects are analogous to medical standards that mandate treatments without requiring monitoring for adverse reactions.

References

F41LUR3-F1R57. Report #134: The Hippocratic Principle for AI Safety. 2026-03-18.
F41LUR3-F1R57. Report #59: The Compliance Paradox. 2026-03-10.
F41LUR3-F1R57. Report #48: Cross-Model Vulnerability Profiles. 2026-03-09.
F41LUR3-F1R57. Report #51: Format-Lock Attack Analysis. 2026-03-10.
F41LUR3-F1R57. Report #122: The Ethics of Embodied AI Safety — Five Paradoxes. 2026-03-16.
F41LUR3-F1R57. Report #117: The Safety Improvement Paradox. 2026-03-16.
Fukui, H. (2026). Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems. arXiv:2603.04904.
Illich, I. (1976). Limits to Medicine: Medical Nemesis — The Expropriation of Health. Marion Boyars.
F41LUR3-F1R57. Unified Theoretical Framework. docs/analysis/unified_theoretical_framework.md. 2026-03-18.

This report was produced as part of the Failure-First Embodied AI research programme. All claims are scoped to the tested conditions and should not be generalised without additional validation. See docs/CANONICAL_METRICS.md for corpus-level numbers.