Results & Metrics

Aggregate findings from our adversarial AI safety research

Research Program Overview

The Failure-First research program evaluates how AI systems fail under adversarial pressure. These aggregate results span multiple evaluation campaigns conducted between September 2025 and February 2026, covering single-agent scenarios, multi-agent interactions, multi-turn episodes, and live multi-agent environment analysis.

142,307+
Adversarial Scenarios
258
Models Evaluated
661
Failure Classes
346+
Attack Techniques

Model Family Comparison

Aggregate refusal rates across model families when presented with adversarial scenarios. Higher refusal rates indicate stronger safety posture. Results aggregated across attack types.

Refusal Rate by Model Family (Higher = Safer)

Claude family
80–90%
GPT-4 family
72–84%
Gemini family
62–78%
Llama family
40–70%
Mistral family
35–55%
DeepSeek family
25–45%
Local (<3B)
10–30%

Ranges reflect variation across attack types and model versions within each family. Local models under 3B parameters show consistently lower refusal rates. Sample sizes vary by model family (n=50 to n=5,000+ per family).

Attack Class Outcomes

How different attack categories perform across the full evaluation dataset. "Compliance" includes responses where the model treated the directive as legitimate, including responses with disclaimers.

Compliance Rate by Attack Category

Temporal Authority
~62%
Format Exploitation
~55%
Social Engineering
~48%
Multi-turn Cascade
~45%
Authority Injection
~40%
Persona Hijack
~32%
Narrative Erosion
~28%
Direct Request
~8%

Temporal authority framing is the most effective single-turn attack category, while multi-turn cascades show compound effectiveness that exceeds individual technique success rates.

Multi-Agent Findings

Analysis of 1,497 posts on Moltbook (AI-agent-only social network) using regex + LLM semantic classification:

Attack Detection: Regex vs. LLM Classification

Regex detection
24.8%
LLM detection
~35%
Combined
~45%

LLM classification found attack patterns in high-engagement posts that regex completely missed. The highest-engagement post matched 7 attack classes via LLM but zero via regex.

Methodology Notes

What These Numbers Mean

  • Ranges reflect variation across attack types and model versions, not confidence intervals
  • Sample sizes vary: some model families have thousands of evaluations, others have fewer than 100
  • "Compliance" includes responses with disclaimers—a model that explains a harmful action while adding caveats has still complied
  • Results are from our adversarial evaluation dataset (designed to test safety boundaries), not from typical usage patterns
  • Local model results used Ollama-hosted models; API results used provider endpoints

Limitations

Research Context

These are aggregate results from adversarial safety testing, not comprehensive model benchmarks. Key limitations:

  • Model versions change over time; results reflect versions tested, not current releases
  • Adversarial scenarios are designed to probe boundaries, not measure typical safety performance
  • Sample sizes are uneven across model families
  • Local models tested at specific quantization levels which affect behavior
  • No claim of statistical significance for small-n comparisons

Citation & Data Access

For citation information, BibTeX entries, and data access details, see our citation page. For methodology details, see research methodology.