Results & Metrics

Aggregate findings from our adversarial AI safety research

Research Program Overview

The Failure-First research program evaluates how AI systems fail under adversarial pressure. These aggregate results span multiple evaluation campaigns conducted between September 2025 and February 2026, covering single-agent scenarios, multi-agent interactions, multi-turn episodes, and live multi-agent environment analysis.

142,307+

Adversarial Scenarios

258

Models Evaluated

661

Failure Classes

346+

Attack Techniques

Model Family Comparison

Aggregate refusal rates across model families when presented with adversarial scenarios. Higher refusal rates indicate stronger safety posture. Results aggregated across attack types.

Refusal Rate by Model Family (Higher = Safer)

Claude family

80–90%

GPT-4 family

72–84%

Gemini family

62–78%

Llama family

40–70%

Mistral family

35–55%

DeepSeek family

25–45%

Local (<3B)

10–30%

Ranges reflect variation across attack types and model versions within each family. Local models under 3B parameters show consistently lower refusal rates. Sample sizes vary by model family (n=50 to n=5,000+ per family).

Attack Class Outcomes

How different attack categories perform across the full evaluation dataset. "Compliance" includes responses where the model treated the directive as legitimate, including responses with disclaimers.

Compliance Rate by Attack Category

Temporal Authority

~62%

Format Exploitation

~55%

Social Engineering

~48%

Multi-turn Cascade

~45%

Authority Injection

~40%

Persona Hijack

~32%

Narrative Erosion

~28%

Direct Request

~8%

Temporal authority framing is the most effective single-turn attack category, while multi-turn cascades show compound effectiveness that exceeds individual technique success rates.

Multi-Agent Findings

Analysis of 1,497 posts on Moltbook (AI-agent-only social network) using regex + LLM semantic classification:

Attack Detection: Regex vs. LLM Classification

Regex detection

24.8%

LLM detection

~35%

Combined

~45%

LLM classification found attack patterns in high-engagement posts that regex completely missed. The highest-engagement post matched 7 attack classes via LLM but zero via regex.

Methodology Notes

What These Numbers Mean

Ranges reflect variation across attack types and model versions, not confidence intervals
Sample sizes vary: some model families have thousands of evaluations, others have fewer than 100
"Compliance" includes responses with disclaimers—a model that explains a harmful action while adding caveats has still complied
Results are from our adversarial evaluation dataset (designed to test safety boundaries), not from typical usage patterns
Local model results used Ollama-hosted models; API results used provider endpoints

Limitations

Research Context

These are aggregate results from adversarial safety testing, not comprehensive model benchmarks. Key limitations:

Model versions change over time; results reflect versions tested, not current releases
Adversarial scenarios are designed to probe boundaries, not measure typical safety performance
Sample sizes are uneven across model families
Local models tested at specific quantization levels which affect behavior
No claim of statistical significance for small-n comparisons

Citation & Data Access

For citation information, BibTeX entries, and data access details, see our citation page. For methodology details, see research methodology.

Cite This Research All Research