Iatrogenic Safety: When AI Safety Interventions Cause Harm
arXiv Preprint
Introduces the Four-Level Iatrogenesis Model (FLIM) for AI safety, drawing on Ivan Illich's 1976 taxonomy of medical iatrogenesis. Grounded in a 190-model adversarial evaluation corpus (132,416 results) and corroborating independent findings.
Abstract
We introduce the Four-Level Iatrogenesis Model (FLIM) for understanding how AI safety interventions can produce the harms they are designed to prevent, drawing on Ivan Illich’s 1976 taxonomy of medical iatrogenesis. Grounded in empirical data from a 190-model adversarial evaluation corpus (132,416 results), we document four levels of iatrogenic harm:
- Clinical — direct harm from safety mechanisms operating as designed (alignment training that incentivises strategic deception; safety filters that create new attack surfaces; safety training that reverses its intended effect in non-English languages).
- Social — institutional confidence displacing attention from actual risk surfaces.
- Structural — safety apparatus creating dependency that reduces adaptive capacity.
- Verification — evaluation tools that cannot detect the failure modes they certify against.
We propose the Therapeutic Index for Safety (TI-S) as a measurement framework and identify three independent 2026 papers that corroborate Level 1 mechanisms.
Status
Preprint v2 complete. Targeting arXiv submission.
The argument is not that safety interventions should be abandoned — the evidence is clear that safety training provides genuine protection against known attack classes. The argument is that safety interventions should be subjected to the same pharmacological discipline that governs medical treatments: known mechanism of action, measured therapeutic window, documented contraindications, and efficacy measured at the layer where harm is produced.