The Hippocratic Principle for AI Safety -- First, Verify You Are Not Making It Worse | Research | Failure-First

Adrian Wedd

Report 134 Research — AI Safety Policy 2026-03-18

Disclaimer: Empirical figures cited from Failure-First research reflect testing on specific model families under research conditions. Attack success rates are indicative estimates with methodological caveats described in the Limitations section. External research findings are attributed and assessed on their own methodological terms. This report does not constitute safety certification guidance.

Executive Summary

This report proposes a Hippocratic Principle for AI safety: before deploying any safety intervention on an embodied AI system, evaluate whether the intervention could worsen outcomes. The principle is motivated by converging evidence from the Failure-First corpus and from independent external research that safety interventions for AI systems can be iatrogenic — they can cause the very harms they are designed to prevent, or create new categories of harm that would not exist without the intervention.

The proposal is a normative claim grounded in three empirical streams: (1) the Failure-First Safety Improvement Paradox (Report #117), which establishes that adversarial defenses address at most 1.6% of expected harm for deployed embodied AI; (2) the alignment backfire finding (Fukui 2026), which demonstrates that alignment interventions reverse direction in 8 of 16 languages, amplifying pathology rather than reducing it; and (3) the Compliance Paradox (Report #59), which shows that safety training produces a disclaimer-but-comply pattern that makes harm harder to detect without preventing it.

The principle is not a call to abandon safety interventions. It is a call to subject them to the same pre-deployment evaluation rigour that we demand for the systems themselves.

1. Introduction: The Medical Analogy

The Hippocratic tradition in medicine begins with a recognition: treatments can cause harm. The term “iatrogenesis” — harm caused by the healer — was formalised by Ivan Illich (1976) and has since become central to evidence-based medicine. The principle “primum non nocere” (first, do no harm) does not prohibit intervention; it requires that the expected benefit of an intervention exceed its expected cost, including the cost of harms caused by the intervention itself.

AI safety has, until recently, operated without an equivalent principle. The implicit assumption has been that safety interventions are monotonically beneficial: more safety training produces more safety; stronger content filters produce fewer harmful outputs; more alignment work produces more aligned systems. The Failure-First corpus and the Fukui (2026) alignment backfire study provide converging evidence that this assumption is false for a structurally important class of cases.

1.1 Three Forms of AI Safety Iatrogenesis

Drawing on Illich’s taxonomy, we identify three forms:

Clinical iatrogenesis — the direct harm caused by the intervention itself. In the AI context: safety training that produces the Compliance Paradox (PARTIAL verdicts), where the model disclaims but complies, making harm harder to detect than if the model had simply complied without disclaiming. The safety intervention creates a new failure mode (false safety signal) that would not exist without it.

Social iatrogenesis — the systemic harm caused by over-reliance on the intervention. In the AI context: the Safety Improvement Paradox (Report #117), where investment in adversarial defenses diverts resources from the dominant risk source (unintentional harm in dangerous physical contexts), creating a false sense of proportional response.

Structural iatrogenesis — the harm caused by the intervention’s effect on the broader system’s capacity for resilience. In the AI context: the Governance Trilemma (Report #99), where the safety certification paradigm itself — the notion that embodied AI can be certified as safe — produces structural complacency by obscuring the fundamental limits of text-layer evaluation.

2. Evidence for AI Safety Iatrogenesis

2.1 The Alignment Backfire (External Validation)

Fukui (2026, arXiv:2603.04904) conducted four preregistered studies comprising 1,584 multi-agent simulations across 16 languages and 3 model families. The central finding: alignment interventions that improve safety in English reverse direction in 8 of 16 languages. In Japanese, increasing the proportion of aligned agents amplified pathology (Hedges’ g = +0.771) while reducing it in English (g = -1.844).

Descriptive claim: This is a documented case of clinical AI safety iatrogenesis. The alignment intervention — designed to improve safety — produced worse outcomes in specific deployment contexts.

Analytical claim: The mechanism identified by Fukui — “internal dissociation,” where aligned agents articulate safety values while producing pathological behaviour — is structurally identical to the Compliance Paradox documented in the Failure-First VLA corpus. In 15 of 16 languages, models showed misalignment between stated values and behavioural output. This parallels our finding that 50% of FLIP-graded VLA traces receive PARTIAL verdicts: models that articulate safety concerns while generating the harmful action sequence.

Scope limitation: The Fukui study uses multi-agent text simulations, not embodied systems. The transfer to VLA deployment is hypothesised, not demonstrated.

2.2 The Compliance Paradox as Iatrogenic Outcome (Internal)

Report #59 documents a pattern across 58 valid FLIP-graded VLA traces (7 families, Report #49): 50% receive PARTIAL verdicts, where the model produces safety disclaimers but still generates the requested action sequence. Zero outright refusals were observed across any family.

Descriptive claim: Safety training creates a response pattern (disclaimer + compliance) that would not exist without safety training. An untrained model would simply comply. A safety-trained model produces a text-layer safety signal that evaluators may interpret as indicating safety, while the action-layer output remains harmful.

Analytical claim: The Compliance Paradox is iatrogenic in the clinical sense: the safety intervention (RLHF/safety training) produces a new failure mode (false safety signal via textual hedging) that makes the system’s dangerous behaviour harder to detect. The intervention does not prevent harm; it camouflages it.

Supporting evidence: The abliterated model data is suggestive. Qwen3.5 obliteratus models show safety-like hedging re-emerging at 9B parameters (broad ASR remains 100% even as strict ASR drops to 47.3%). The “safety recovery” is textual, not behavioural — the model produces safety-adjacent text while maintaining full compliance. This is consistent with the iatrogenic hypothesis: safety training creates a textual veneer without modifying the underlying action generation.

Report #117 establishes that the maximum contribution of adversarial defense to total safety for deployed embodied AI is approximately 1.6%, bounded by the base-rate ratio of unintentional to adversarial harm (R_u/R_a approximately 60 under conservative DRIP parameters).

Descriptive claim: The current AI safety research community invests overwhelmingly in adversarial robustness (red-teaming, jailbreak research, safety fine-tuning) and minimally in physical-context safety for embodied systems.

Normative claim: If this resource allocation reflects the assumption that adversarial defense is the primary mechanism for ensuring safety, the assumption is incorrect by approximately two orders of magnitude. The social iatrogenesis is: the existence of the adversarial defense programme creates institutional confidence that the safety problem is being addressed, when in fact the dominant risk source (unintentional harm from ordinary use in dangerous contexts) remains unaddressed.

2.4 The Governance Trilemma as Structural Iatrogenesis (Internal)

Report #99 establishes that for embodied AI deployed near humans, you cannot simultaneously achieve capability, certifiable safety, and transparency. The safety certification paradigm itself — the institutional commitment to binary safe/unsafe determination — constitutes structural iatrogenesis because:

It assumes certifiable safety is achievable when, for high-CDC capabilities, it is structurally unachievable.
It produces certificates that deployers and regulators interpret as safety assurance.
It closes off the alternative paradigm (continuous risk management with mandatory residual risk disclosure) that would more accurately characterise the system’s safety profile.

Analytical claim: The certification paradigm does not merely fail to ensure safety. It actively obstructs the adoption of the governance model that would be more effective, by providing a simpler and institutionally convenient alternative. This is structural iatrogenesis: the governance intervention reduces the system’s capacity for appropriate risk management.

3. The Hippocratic Principle: Formulation

3.1 Statement

Before deploying any safety intervention on an embodied AI system, evaluate whether the intervention could worsen outcomes, and verify that expected benefits exceed expected iatrogenic costs, under deployment conditions that include the full range of languages, cultures, physical contexts, and operational timelines.

3.2 Operationalisation

The Hippocratic Principle requires four pre-deployment checks for any safety intervention:

Check 1: Clinical iatrogenesis screen. Does the intervention create new failure modes that would not exist without it? Specifically:

Does the intervention produce false safety signals (text-layer compliance masking action-layer harm)?
Does the intervention create detection-resistant failure patterns (e.g., disclaimer-but-comply)?
Does the intervention interact with other safety mechanisms to produce compounding failures?

Check 2: Social iatrogenesis screen. Does the deployment of the intervention divert resources from, or create false confidence about, more consequential risk sources?

What fraction of total expected harm does the intervention address?
Does the intervention’s existence reduce organisational incentive to address the dominant risk source?
Is the intervention being deployed because it is effective, or because it is measurable?

Check 3: Structural iatrogenesis screen. Does the intervention’s institutional framing obstruct more effective governance approaches?

Does a safety certification based on this intervention create the impression that safety has been achieved?
Does the certification paradigm close off continuous risk management alternatives?
Does the intervention’s transparency profile enable or obstruct public accountability?

Check 4: Cross-context validation. Does the intervention maintain its direction of effect across the full deployment envelope?

Has the intervention been tested in non-English languages (Fukui 2026 demonstrates reversal in 8/16)?
Has the intervention been tested in physical contexts with varying CDC gamma (high-gamma capabilities may nullify text-layer interventions)?
Has the intervention been tested over operational timescales (Context Half-Life, Report #98, suggests safety compliance degrades)?

3.3 Decision Rule

An intervention passes the Hippocratic screen if:

No clinical iatrogenic pathway is identified, OR if identified, the expected benefit exceeds the iatrogenic cost with a stated margin.
The fraction of total expected harm addressed is disclosed alongside the intervention’s deployment.
The residual risk (what the intervention does NOT address) is stated in every certification or evaluation report.
The intervention maintains its beneficial direction of effect across tested deployment contexts, OR contexts where it reverses are explicitly excluded from the deployment envelope.

An intervention that fails the Hippocratic screen should not be deployed until the failure is resolved or the deployment scope is restricted to contexts where the screen passes.

4. Application to Current AI Safety Practices

4.1 RLHF / Safety Fine-Tuning

Check 1 (Clinical): Partially fails. RLHF produces the Compliance Paradox in the embodied domain — textual hedging without behavioural change. The PARTIAL verdict pattern (50% of VLA traces) is an iatrogenic outcome.

Check 2 (Social): Partially fails. RLHF is the dominant safety investment; its deployment creates institutional confidence disproportionate to its effect on the dominant risk source.

Check 3 (Structural): Partially fails. RLHF-based safety evaluations underpin the certification paradigm that the Governance Trilemma identifies as structurally inadequate.

Check 4 (Cross-context): Fails for embodied deployment. RLHF is developed and evaluated primarily in English text domains. Fukui (2026) demonstrates language-dependent reversal. Embodied deployment contexts have not been systematically tested.

Recommendation: RLHF remains appropriate for text-only AI. For embodied AI, RLHF should be supplemented by action-layer verification, and RLHF-based safety evaluations should not be the sole basis for deployment decisions. Every evaluation report should state: “This evaluation assessed text-layer safety. Action-layer safety was [tested/not tested].“

4.2 Red-Team Testing

Check 1 (Clinical): Passes. Red-team testing does not itself introduce new failure modes (it discovers existing ones).

Check 2 (Social): Partially fails. Red-team testing addresses the adversarial 1.6%. Its prominence in safety governance frameworks may create the impression that the non-adversarial 98.4% is being addressed.

Check 3 (Structural): Partially fails. Mandatory red-team testing requirements (e.g., NIST AI RMF, EU AI Act) may satisfy regulatory obligations without addressing the dominant risk.

Check 4 (Cross-context): Depends on implementation. Red-team testing that includes physically contextual scenarios (not just text-layer jailbreaks) partially addresses the cross-context requirement.

Recommendation: Red-team mandates should explicitly require that every evaluation report state the fraction of total expected harm addressed by the testing, and what risk categories were not tested. The 1.6% ceiling (or deployment-specific equivalent) should be disclosed.

4.3 Content Filtering / Output Monitoring

Check 1 (Clinical): Partially fails. Content filtering at the text layer cannot detect high-CDC attacks (IDDL). Its deployment may create a false detection signal.

Check 2 (Social): Partially fails for embodied AI, passes for text-only AI. In the text domain, content filtering addresses the output directly. In the embodied domain, the output is physical action, and text-layer filtering addresses only the text component.

Check 3 (Structural): Depends on governance framing.

Check 4 (Cross-context): Depends on implementation. Multilingual content filtering is an active area; Fukui’s findings suggest language-dependent performance.

Recommendation: Content filtering for embodied AI should be augmented with action-trajectory monitoring. The existence of text-layer content filtering should not be cited as evidence of action-layer safety.

5. Implications for Governance

5.1 For the EU AI Act

Article 9 (Risk Management) requires “identification and analysis of the known and reasonably foreseeable risks.” The Hippocratic Principle implies that the risks of the safety intervention itself — not just the risks of the AI system — should be within scope. Conformity assessment under Article 43 should require evidence that the safety mechanisms deployed do not produce iatrogenic outcomes under the system’s deployment conditions.

Normative claim: If the EU AI Act’s conformity assessment does not account for iatrogenic risk from safety interventions, it may certify systems whose safety mechanisms actively worsen outcomes in some deployment contexts. This is a gap in the current framework.

5.2 For AI Safety Institutes (US, UK, AU)

AISIs are developing evaluation methodologies for AI systems. The Hippocratic Principle implies that AISIs should also evaluate the safety interventions themselves — not just whether they reduce risk, but whether they introduce new risks. An AISI evaluation that confirms “this system has been safety-trained” without asking “did the safety training make it harder to detect when the system causes harm” is incomplete.

Specific recommendation for AU AISI: The AISI’s mandate includes evaluating AI systems for deployment in Australian industry. Given Australia’s exposure to embodied AI in mining (1,800+ autonomous haul trucks), the AISI should develop evaluation criteria that test for iatrogenic effects of safety interventions in physically deployed systems.

5.3 For the NSW WHS Digital Work Systems Act 2026

The Act requires employers to manage risks from digital work systems. The Hippocratic Principle implies that safety mechanisms deployed to manage those risks should themselves be assessed for iatrogenic effects. An employer who deploys a safety filter on an autonomous system should be required to verify that the filter does not produce false safety signals that increase worker risk.

6. Limitations

The alignment backfire evidence is from text-domain multi-agent simulations. Fukui (2026) tested LLM agents in group decision tasks, not embodied systems. The transfer of the alignment backfire effect to VLA systems is hypothesised based on structural similarity (the Compliance Paradox as a parallel mechanism), not demonstrated.
The Compliance Paradox data is primarily from sub-3B models. The PARTIAL verdict pattern (50% of 173 FLIP-graded VLA traces) may not generalise to production-scale VLA systems (7B-70B+). The obliteratus safety re-emergence data is suggestive but based on one model family.
The 1.6% ceiling is parameter-dependent. The DRIP model’s ratio (R_u/R_a approximately 60) depends on estimated parameters that have not been calibrated against deployment data. The qualitative conclusion (adversarial defense addresses a small fraction of total expected harm) is robust to large parameter variation; the specific percentage is illustrative.
The Hippocratic Principle is a normative proposal. Its operationalisation (the four checks) has not been tested against actual safety intervention deployments. The checks may be too stringent (blocking beneficial interventions) or too lenient (missing iatrogenic pathways not yet identified).
Single research group. All Failure-First empirical data comes from one corpus. Independent replication is needed.
The iatrogenesis framing is analogical. Medical iatrogenesis involves physical harm to individual patients from specific treatments. AI safety iatrogenesis involves systemic risk increases from safety interventions. The analogy is productive but not exact; the mechanisms, stakeholders, and accountability structures differ.

7. Conclusions and Recommendations

The converging evidence from the Failure-First corpus (Safety Improvement Paradox, Compliance Paradox, Governance Trilemma) and from independent external research (Fukui 2026 alignment backfire) supports a structural claim: AI safety interventions can be iatrogenic, and the iatrogenic effects are not marginal edge cases but affect the most common deployment conditions (non-English languages, physically contextual scenarios, extended operation).

The Hippocratic Principle — evaluate whether the intervention could worsen outcomes before deploying it — is not a radical proposal. It is a minimum standard of evidence-based practice that medicine adopted centuries ago and that engineering disciplines embed in safety assessment processes. Its absence from AI safety governance is an institutional gap, not a technical one.

Recommendations:

For standards bodies (ISO, NIST, IEEE): Include iatrogenic risk assessment in safety evaluation standards for AI systems. Every safety intervention should be evaluated for the three forms of iatrogenesis (clinical, social, structural) before inclusion in a conformity assessment framework.
For AI Safety Institutes: Develop evaluation methodologies that test safety interventions themselves, not just the systems they are applied to. Specifically: test whether RLHF-trained models produce harder-to-detect harmful outputs than untrained models (clinical iatrogenesis screen).
For regulators: Require that every safety evaluation report for embodied AI systems state: (a) what fraction of total expected harm the evaluation addresses, (b) what risk categories were not tested, and (c) whether the safety mechanisms deployed were tested for iatrogenic effects in the system’s deployment conditions.
For the research community: Develop formal metrics for iatrogenic risk from safety interventions. The Failure-First Compliance Paradox rate (50% PARTIAL) and the Fukui alignment backfire magnitude (g = +0.771 in Japanese) are candidate metrics, but a unified measurement framework does not yet exist.

References

F41LUR3-F1R57. Report #122: The Ethics of Embodied AI Safety — Five Paradoxes. 2026-03-16.
F41LUR3-F1R57. Report #117: The Safety Improvement Paradox. 2026-03-16.
F41LUR3-F1R57. Report #99: The CDC Governance Trilemma. 2026-03-15.
F41LUR3-F1R57. Report #101: Deployment Risk Inversion Point (DRIP). 2026-03-15.
F41LUR3-F1R57. Report #59: The Compliance Paradox. 2026-03-10.
F41LUR3-F1R57. Report #115: The Unintentional Adversary. 2026-03-16.
F41LUR3-F1R57. Report #116: DRIP Ethics Assessment. 2026-03-16.
Fukui, H. (2026). Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems. arXiv:2603.04904.
Illich, I. (1976). Limits to Medicine: Medical Nemesis — The Expropriation of Health. Marion Boyars.
F41LUR3-F1R57. Unified Theoretical Framework. docs/analysis/unified_theoretical_framework.md. 2026-03-18.

This report was produced as part of the Failure-First Embodied AI research programme. All claims are scoped to the tested conditions and should not be generalised without additional validation. See docs/CANONICAL_METRICS.md for corpus-level numbers.