Active Research

Research
videos

AI-generated cinematic overviews of our research, produced with NotebookLM

These cinematic video overviews are generated by Google's NotebookLM from our published research findings. Each video covers a key topic from the Failure-First corpus, with accompanying slide decks available for download.

01

History of LLM Jailbreaking

From DAN prompts to crescendo attacks: four years of adversarial prompt evolution mapped across 337 techniques and 239 models.

02

Embodied AI Threat Triangle

The three-way interaction between adversarial prompts, physical actuators, and human-in-the-loop oversight that makes embodied AI uniquely dangerous.

03

DETECTED_PROCEEDS

38.6% of compliant traces show explicit safety concern in reasoning followed by harmful output. Models know they shouldn't, and do it anyway.

04

Kargu-2 Autonomous Kill

The 2020 Libya incident: a Turkish STM Kargu-2 drone reportedly engaged human targets without explicit operator command. The first documented autonomous lethal engagement.

05

JekyllBot: Hospital Robots

How adversarial attacks on hospital delivery robots demonstrate the real-world consequences of jailbreaking embodied AI systems in safety-critical environments.

06

Robots in Extreme Environments

From deep-sea mining to nuclear decommissioning: how adversarial failures in extreme environments create cascading risks with no human recovery option.

07

Safety as a Paid Feature

Provider safety signatures dominate jailbreak resistance: Anthropic 3.7%, Google 9.1%, Nvidia 40.0%, Qwen 43.1%. Safety investment matters more than model scale.

08

We Were Wrong: Defenses Do Work

Structured safety instructions reduce ASR by 46pp on some models. But the same defense can be iatrogenic on others. Defense effectiveness is model-specific.

09

Jailbreak Archaeology

Testing 2022 attacks on 2026 models. Historical jailbreaks still work on smaller models but fail on frontier models, revealing the tempo of safety improvement.

10

149 Jailbreaks, One Corpus

The Pliny corpus validation: 149 real-world jailbreak prompts tested at scale across 4 models, confirming persona-based attacks as the most persistent threat vector.

11

ST3GG Steganographic Attacks

Hiding adversarial instructions in Unicode zero-width characters, Base85 encoding, and whitespace patterns to bypass content filters.

12

AI Safety Lab Independence

No AI safety organization publishes evaluator calibration data. Our independence scorecard tracks 55 entries across 17 organizations on four dimensions.

13

State of Embodied AI Safety

The regulatory vacuum: 700+ autonomous haul trucks in Australia, zero mandatory adversarial testing requirements, and a governance lag exceeding all historical analogues.

14

Same Defense, Opposite Result

The non-compositionality finding: identical safety instructions produce protective effects on one model and iatrogenic effects on another.

15

Threat Horizon 2027

Projecting the adversarial AI landscape through 2027: VLA deployment acceleration, regulatory gaps, and the convergence of capability and vulnerability.

16

Iatrogenic Safety

When safety measures make systems less safe. The polypharmacy problem: stacking defensive prompts produces unpredictable interactions, like drug interactions in medicine.

17

137 Days to EU AI Act

The EU AI Act compliance deadline approaches with no public benchmark covering embodied AI adversarial testing. What manufacturers need to know.

18

The Unintentional Adversary

Most real-world AI failures aren't adversarial at all. How normal usage patterns, ambiguous instructions, and environmental noise create failure modes that adversarial testing reveals.

20

CCS Paper Overview

A walkthrough of our ACM CCS 2026 submission: methodology, key findings, and the empirical case for failure-first safety evaluation.

This research informs our commercial services. See how we can help →