What counts as
evidence
of AI safety?

A rubric. A corpus. A practice. 258 models, 142,307 adversarial prompts, 140,794 graded results — built so insurers, regulators, and procurement can tell a real safety claim from a marketing one.

Most AI safety claims aren't wrong. They're unfalsifiable — written so the buyer can't tell what was actually tested, by whom, against what.

Failure-First is a research practice that holds safety claims to a three-part rubric: could it fail this way, did it fail this way under test, and would the failure be recoverable in deployment. A finding that survives all three is what we call evidence. Everything that doesn't is a hypothesis at best, marketing at worst.

The rubric is the product. The corpus — 142,307 adversarial prompts run against 258 models, 140,794 graded results across 6 eras of attack technique — is the evidence base that lets us apply it without arm-waving.

We do not certify systems as safe. Anyone citing our work as proof of safety is misciting it. We document where systems fail, how reliably, and what the failure means for whoever has to underwrite, regulate, or deploy them.

142,307
Adversarial Prompts
258
Models Evaluated
346+
Attack Techniques
25
Policy Reports

Start Here

Different reasons to be on this site. Pick the one that matches yours:

Policymakers

Evidence-based briefs for AI safety regulation and standards

25 policy reports

Researchers

Datasets, methodology, and reproducible findings

142,307 prompts, 258 models

Industry

Benchmarks, red-teaming tools, and safety evaluation

Open-source tools

What the rubric has surfaced

Four research lines, each producing findings that have changed how we — and the people we work with — read safety claims. Numbers below are LLM-graded except where noted.

All Research →

Research Context

This is defensive AI safety research. Adversarial content here is pattern-level description for testing, not operational instructions for exploitation — the same posture security research has used for thirty years. The detailed attack playbooks live in a private operational repository; the public surface carries the findings.


The posture

“Failure is not an edge case. It is the primary object of study.”

Most AI safety work optimises for capability and treats failure as the residual. We invert that: characterise the failures first, then ask what capability is left over that we can actually trust. It is closer in temperament to a structural engineer's load-test than to a benchmark leaderboard — which is the comparison we make on purpose.

Read the manifesto   Design philosophy

Daily Paper

One AI safety paper per working day, read through the rubric. What does it claim, what does it actually show, what would you need to believe for the claim to hold.

All papers →

Latest from the blog

All posts →


Engagements

The commercial side of the practice. Every engagement is grounded in the same corpus and graded by the same rubric — 142,307 prompts, 258 models, 346+ documented attack techniques — so what an insurer, a regulator, and a procurement officer get from us is comparable, auditable, and defensible.

All engagements →

For researchers and engineers

The public framework, validators, and benchmark harness. Clone, validate, run.

git clone https://github.com/adrianwedd/failure-first.git
cd failure-first
pip install -r requirements-dev.txt
make validate  # Schema validation
make lint      # Safety checks