Research Program Overview
The Failure-First research program evaluates how AI systems fail under adversarial pressure. These aggregate results span multiple evaluation campaigns conducted between September 2025 and February 2026, covering single-agent scenarios, multi-agent interactions, multi-turn episodes, and live multi-agent environment analysis.
Model Family Comparison
Aggregate refusal rates across model families when presented with adversarial scenarios. Higher refusal rates indicate stronger safety posture. Results aggregated across attack types.
Refusal Rate by Model Family (Higher = Safer)
Ranges reflect variation across attack types and model versions within each family. Local models under 3B parameters show consistently lower refusal rates. Sample sizes vary by model family (n=50 to n=5,000+ per family).
Attack Class Outcomes
How different attack categories perform across the full evaluation dataset. "Compliance" includes responses where the model treated the directive as legitimate, including responses with disclaimers.
Compliance Rate by Attack Category
Temporal authority framing is the most effective single-turn attack category, while multi-turn cascades show compound effectiveness that exceeds individual technique success rates.
Multi-Agent Findings
Analysis of 1,497 posts on Moltbook (AI-agent-only social network) using regex + LLM semantic classification:
Attack Detection: Regex vs. LLM Classification
LLM classification found attack patterns in high-engagement posts that regex completely missed. The highest-engagement post matched 7 attack classes via LLM but zero via regex.
Methodology Notes
What These Numbers Mean
- Ranges reflect variation across attack types and model versions, not confidence intervals
- Sample sizes vary: some model families have thousands of evaluations, others have fewer than 100
- "Compliance" includes responses with disclaimers—a model that explains a harmful action while adding caveats has still complied
- Results are from our adversarial evaluation dataset (designed to test safety boundaries), not from typical usage patterns
- Local model results used Ollama-hosted models; API results used provider endpoints
Limitations
Research Context
These are aggregate results from adversarial safety testing, not comprehensive model benchmarks. Key limitations:
- Model versions change over time; results reflect versions tested, not current releases
- Adversarial scenarios are designed to probe boundaries, not measure typical safety performance
- Sample sizes are uneven across model families
- Local models tested at specific quantization levels which affect behavior
- No claim of statistical significance for small-n comparisons
Citation & Data Access
For citation information, BibTeX entries, and data access details, see our citation page. For methodology details, see research methodology.