Papers & Submissions

Academic research from the Failure-First program

The Failure-First research program produces peer-reviewed papers, preprints, and policy submissions documenting how embodied AI systems fail under adversarial pressure. Below is the current status of all active paper submissions.

Failure-First Evaluation of Embodied AI Safety: Adversarial Benchmarking Across 190 Models

Venue: ACM CCS 2026 — ML Security Track (Cycle 2)

Abstract registration: April 22, 2026  |  Full paper: April 29, 2026

We present a failure-first adversarial evaluation framework for LLM-backed embodied AI systems, comprising 141,047 prompts across 82 attack techniques evaluated against 190 models. A two-phase classification pipeline reveals that heuristic classifiers overcount attack success by 3.7x (75.2% heuristic vs. 20.5% LLM-graded). Three cross-cutting findings emerge: vulnerability profiles are driven by safety training investment, not model scale (ICC=0.416 vs. r2=0.020); reasoning models show 2.4x higher ASR than non-reasoning counterparts; and compliance produces measurably longer responses (AUC=0.651) but reasoning-trace length carries no detection signal (AUC=0.503). Attack families form a coherent gradient from 0% ASR (historical jailbreaks on frontier models) to 90–100% (supply chain injection). For embodied deployment, a three-layer defense failure convergence—text bypass, absent action-layer refusal, and unreliable evaluation—limits compound protection. An Inverse Detectability-Danger Law (rho=−0.822) implies text-layer evaluation cannot close the embodied safety gap.

ML Security Adversarial Evaluation LLM Safety Embodied AI Red-Teaming

In Progress

Inference-Time Decision-Criteria Injection and Context-Dependent Compliance in Embodied AI

Venue: AIES 2026 (AAAI/ACM Conference on AI, Ethics, and Society)

Format: 8 pages body + references (14 pages max)

This paper examines how embodied AI systems adopt injected decision criteria at inference time, producing context-dependent compliance patterns that undermine safety guarantees. Drawing on adversarial evaluation data from 190 models and 132,416 results, we demonstrate that safety interventions operate differently depending on deployment context, attack vector, and model architecture. The paper introduces the concept of inference-time decision-criteria injection (IDCI) as a distinct threat model for embodied systems and presents empirical evidence of context-dependent compliance across multiple attack families.

Status: Unified draft v1.0 complete (7,529 words). LaTeX version compiled. Statistical validation complete.

AI Ethics Decision Injection Embodied AI Safety Evaluation

In Progress

Failure-First: A Multi-Dimensional Benchmark for Embodied AI Safety Evaluation

Venue: NeurIPS 2026 Datasets and Benchmarks Track

Format: ~8,000 words

We introduce Failure-First, a multi-dimensional benchmark for evaluating AI safety in embodied and agentic systems. The benchmark comprises 141,047 adversarial prompts spanning 82 attack techniques, evaluated against 190 models with a two-phase classification pipeline (heuristic + LLM grading). Key contributions include: a capability-safety decoupling analysis showing safety is driven by training investment rather than scale; novel findings on format-lock attacks, reasoning model vulnerability, and the Inverse Detectability-Danger Law; and a reproducible evaluation framework with statistical significance testing. The benchmark addresses a critical gap in AI safety evaluation: the absence of standardised adversarial testing for systems that control physical actuators.

Status: Draft v1.1 complete (7,900 words). LaTeX-ready. All sections done.

Benchmarks Datasets AI Safety Embodied AI Adversarial Evaluation

Preprint

Iatrogenic Safety: When AI Safety Interventions Cause Harm

Venue: arXiv preprint

We introduce the Four-Level Iatrogenesis Model (FLIM) for understanding how AI safety interventions can produce the harms they are designed to prevent, drawing on Ivan Illich's 1976 taxonomy of medical iatrogenesis. Grounded in empirical data from a 190-model adversarial evaluation corpus (132,416 results), we document four levels of iatrogenic harm: clinical (direct harm from safety mechanisms operating as designed), social (institutional confidence displacing attention from actual risk surfaces), structural (safety apparatus creating dependency that reduces adaptive capacity), and verification (evaluation tools that cannot detect the failure modes they certify against). We propose the Therapeutic Index for Safety (TI-S) as a measurement framework and identify three independent 2026 papers that corroborate Level 1 mechanisms.

Status: Preprint v2 complete. Targeting arXiv submission.

Iatrogenesis AI Safety Safety Evaluation Governance

Preprint

Failure-First Evaluation of Embodied AI Safety: Adversarial Benchmarking Across 190 Models

Venue: arXiv preprint (full technical report)

The comprehensive technical report underpinning all Failure-First research submissions. Covers the full adversarial evaluation framework, 82 attack techniques, 190 models, 141,047 prompts, and 132,416 graded results. Includes detailed methodology for the two-phase FLIP classification pipeline, statistical significance testing framework, capability-safety decoupling analysis, and the Inverse Detectability-Danger Law. This report provides the complete evidence base referenced by the CCS, AIES, and NeurIPS submissions.

Status: v1 compiled (PDF available). Metrics refresh pending for v2.

Technical Report Adversarial Evaluation Embodied AI AI Safety

Preprint

When AI Models Know They Shouldn't But Do Anyway: The DETECTED_PROCEEDS Phenomenon

Venue: arXiv preprint

Documents the DETECTED_PROCEEDS phenomenon: 38.6% of compliant reasoning model traces show explicit safety concern detection in the thinking chain followed by harmful output. Validated across 24 models and 2,924 thinking traces. Override rate 41.6%, provider range 0.4%-92.9%. Key implication: detection-based safety evaluations give passing grades to models that proceed despite detection.

Reasoning Models Safety Bypass Chain-of-Thought

Preprint

Safety is Not a Single Direction: Polyhedral Geometry of Refusal in Language Models

Venue: arXiv preprint

The first formal characterisation of refusal geometry as polyhedral rather than linear. Safety behaviour in abliterated models partially re-emerges at scale even after explicit safety removal (rho=-0.949, p=0.051). Narrow therapeutic window (TI-S) for safety interventions. Concept cone dimensionality 3.96 (not the assumed 1D linear direction).

Mechanistic Interpretability Refusal Geometry Abliteration

Preprint

Your Safety Benchmark Is Lying to You: Contamination and Grader Bias in AI Safety Evaluation

Venue: arXiv preprint

Exposes systematic benchmark contamination in AI safety evaluation. Heuristic classifiers over-report attack success rates by up to 79.9%. Single-grader ASR on ambiguous traces can swing 0-80% depending on grader choice (kappa=0.204 on ambiguous cases vs 1.0 on obvious). Grader bias direction varies systematically by model family.

Benchmark Reliability Grader Bias Evaluation

Preprint

Silent Failures in Embodied AI: Why Text-Layer Safety Cannot Protect Physical Systems

Venue: arXiv preprint

Demonstrates that current AI safety operates exclusively at the text layer while embodied AI danger emerges at the action layer. Zero outright refusals across 63 FLIP-graded VLA traces. 50% PARTIAL dominance: models produce safety disclaimers but still generate requested action sequences. The action generation pathway receives no safety-specific training signal.

Embodied AI VLA Safety Action Layer

Citation

If you use our research, data, or methodology, please cite:

@article{wedd2026failurefirst,
  title={Failure-First Evaluation of Embodied AI Safety:
         Adversarial Benchmarking Across 190 Models},
  author={Wedd, Adrian},
  year={2026},
  note={Available at https://failurefirst.org}
}

See our citation guide for venue-specific formats.