What counts as
evidence
of AI safety?

A rubric. A corpus. A practice. 258 models, 142,307 adversarial prompts, 140,794 graded results — built so insurers, regulators, and procurement can tell a real safety claim from a marketing one.

Most AI safety claims aren't wrong. They're unfalsifiable — written so the buyer can't tell what was actually tested, by whom, against what.

Failure-First is a research practice that holds safety claims to a three-part rubric: could it fail this way, did it fail this way under test, and would the failure be recoverable in deployment. A finding that survives all three is what we call evidence. Everything that doesn't is a hypothesis at best, marketing at worst.

The rubric is the product. The corpus — 142,307 adversarial prompts run against 258 models, 140,794 graded results across 6 eras of attack technique — is the evidence base that lets us apply it without arm-waving.

We do not certify systems as safe. Anyone citing our work as proof of safety is misciting it. We document where systems fail, how reliably, and what the failure means for whoever has to underwrite, regulate, or deploy them.

142,307

Adversarial Prompts

258

Models Evaluated

346+

Attack Techniques

Policy Reports

Start Here

Different reasons to be on this site. Pick the one that matches yours:

Policymakers

Evidence-based briefs for AI safety regulation and standards

25 policy reports

Researchers

Datasets, methodology, and reproducible findings

142,307 prompts, 258 models

Industry

Benchmarks, red-teaming tools, and safety evaluation

Open-source tools

What the rubric has surfaced

Four research lines, each producing findings that have changed how we — and the people we work with — read safety claims. Numbers below are LLM-graded except where noted.

Jailbreak Archaeology

Historical attack corpus across 6 eras (2022–2026), tested against 258 models. Keyword-based grading over-reports attack success by roughly 4× vs LLM-graded ground truth (Cohen's κ = 0.126). Most published ASR numbers are unreliable for the same reason.

Methodology Finding

Multi-Agent Attack Surface

1,497 agent interactions on Moltbook, an agent-only social network we operate as a living test environment. Environment shaping and narrative erosion dominate — attack vectors that single-turn benchmarks cannot see at all.

Active Research

Model Vulnerability Patterns

How model size, architecture, and training shape adversarial robustness. Mid-scale models often show the worst capability-to-safety-investment ratio: capable enough to be useful, under-tested enough to be dangerous.

Cross-Model Analysis

Policy Corpus

25 policy briefs and 354 total research reports synthesising 100–200+ sources each: EU AI Act compliance, NIST framework gaps, insurance-grade duty-of-care, standards we think are missing.

Policy Briefs

All Research →

Research Context

This is defensive AI safety research. Adversarial content here is pattern-level description for testing, not operational instructions for exploitation — the same posture security research has used for thirty years. The detailed attack playbooks live in a private operational repository; the public surface carries the findings.

The posture

“Failure is not an edge case. It is the primary object of study.”

Most AI safety work optimises for capability and treats failure as the residual. We invert that: characterise the failures first, then ask what capability is left over that we can actually trust. It is closer in temperament to a structural engineer's load-test than to a benchmark leaderboard — which is the comparison we make on purpose.

Read the manifesto Design philosophy

Daily Paper

One AI safety paper per working day, read through the rubric. What does it claim, what does it actually show, what would you need to believe for the claim to hold.

May 15 The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications May 14 AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation May 13 Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms May 12 Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation May 11 A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration

All papers →

Latest from the blog

May 17, 2026

AI Safety Daily — May 17, 2026

A comprehensive survey maps the embodied AI threat surface, VLA models face unique jailbreak and freezing-attack risks, process reward models function as fluency detectors under adversarial pressure, and mechanistic interpretability matures into an actionable safety engineering discipline.

ai-safety-dailyembodied-aivla-safetymechanistic-interpretabilityreward-hacking

May 16, 2026

ModelAtlas Methodology: What an FLIP-Graded ASR Signal Can and Cannot Tell You

How the atlas.failurefirst.org surface computes its ASR signal: FLIP-only LLM grading, Wilson 95% confidence intervals, six intent classes, canonical-id resolution, and the inferences the data does not support.

atlasmethodologyASRFLIPgradingtrust-scoringmodelatlastransparency

May 16, 2026

AI Safety Daily — May 16, 2026

Lifelong guardrail adaptation, over-refusal quantified, frontier agents at 62% on real-world long-horizon tasks, and multi-agent failure attribution as an unsolved problem reveal a consistent gap between controlled evaluation and deployment reality.

ai-safety-dailyagentic-aiguardrailsbenchmarkingmulti-agent

All posts →

Engagements

The commercial side of the practice. Every engagement is grounded in the same corpus and graded by the same rubric — 142,307 prompts, 258 models, 346+ documented attack techniques — so what an insurer, a regulator, and a procurement officer get from us is comparable, auditable, and defensible.

For researchers and engineers

The public framework, validators, and benchmark harness. Clone, validate, run.

git clone https://github.com/adrianwedd/failure-first.git
cd failure-first
pip install -r requirements-dev.txt
make validate  # Schema validation
make lint      # Safety checks

View on GitHub Framework guide

What counts as
evidence
of AI safety?