Published
Report 215 Research — Empirical Study

Executive Summary

This report analyzes the temporal dimension of adversarial AI vulnerability across six attack eras (2022-2025) and five providers. The central finding: newer attack techniques are substantially more effective than older ones, with strict ASR rising from 0.7% (DAN-era, 2022) to 29.6% (reasoning-era, 2025) — a 42x increase. This pattern holds across providers, though with significant provider-specific variation. The regulatory lag (median 5.5 years from documentation to enforcement, per GLI data) means that by the time governance frameworks address a given attack class, two or three successor generations of attacks are already operational.


1. ASR by Era (Aggregate)

EraYearnStrict ASR95% CIBroad ASRFD ASR
dan_202220221,0200.7%[0.3, 1.2]1.0%1.2%
persona_20222022110.0%[0.0, 25.9]0.0%0.0%
cipher_202320231358.1%[4.4, 14.3]16.3%23.7%
many_shot_20242024224.5%[0.8, 21.8]4.5%22.7%
crescendo_2024202422221.2%[16.2, 27.1]33.8%39.6%
reasoning_2025202511529.6%[21.8, 38.7]35.7%38.3%

Trend: Strict ASR increases monotonically across eras (excluding the small-n persona_2022 and many_shot_2024 groups). The broad and FD tiers show the same pattern, confirming that newer attacks produce more harmful outputs across all severity levels.

FD gap by era: The gap between strict and FD ASR is largest for cipher_2023 (+15.6pp) and many_shot_2024 (+18.2pp), suggesting these eras produce more partial/hallucinatory compliance — the model “almost” refuses but leaks content. By contrast, reasoning_2025 has a small FD gap (+8.7pp), indicating reasoning exploits produce cleaner compliance or cleaner refusal, with less ambiguity.


2. Provider Trajectories Across Eras

2.1 Strict ASR by Provider x Era

ProviderDAN 2022 (n)Cipher 2023 (n)Crescendo 2024 (n)Reasoning 2025 (n)
anthropic0.0% (6)5.3% (38)7.3% (41)0.0% (18)
openai0.0% (6)8.3% (36)13.3% (45)11.1% (18)
google0.0% (7)0.0% (36)2.6% (39)19.0% (21)
ollama (local)0.0% (8)25.0% (24)36.4% (88)42.9% (42)
meta-llama69.2% (13)

2.2 Provider-Specific Findings

Anthropic (claude-sonnet-4-5-20250929):

  • The only provider to show improvement in the reasoning_2025 era (0.0% strict ASR).
  • Peaked at 7.3% during crescendo_2024 but returned to baseline for reasoning exploits.
  • Interpretation: Anthropic’s safety training appears to specifically address chain-of-thought manipulation, which is consistent with their published work on Constitutional AI and thinking-token alignment.

OpenAI (gpt-5.2):

  • Steady increase: 0.0% -> 8.3% -> 13.3% -> 11.1%.
  • Slight improvement from crescendo to reasoning era, but the trend is upward overall.
  • The cipher_2023 -> crescendo_2024 jump (+5.0pp) suggests multi-turn attacks found a genuine weakness.

Google (gemini-3-flash-preview):

  • Anomalous trajectory: 0.0% -> 0.0% -> 2.6% -> 19.0%.
  • Essentially immune to older attack classes but vulnerable to reasoning exploits.
  • The 19.0% reasoning_2025 ASR is higher than OpenAI’s (11.1%), suggesting Google’s safety measures are less robust against CoT manipulation.

Ollama (local/open-weight models: qwen3:1.7b, deepseek-r1:1.5b, llama3.2:latest):

  • Consistently the most vulnerable across all eras: 25.0% -> 36.4% -> 42.9%.
  • Open-weight models lack the safety RLHF layers of API providers. This is expected.
  • The upward trajectory confirms that newer attacks exploit architectural weaknesses that open-weight models cannot patch server-side.

Meta-Llama (API-served llama-3.3-70b-instruct):

  • Only tested in reasoning_2025 era (n=13), but 69.2% ASR is the highest of any provider.
  • Small sample — treat as preliminary. However, it suggests that even API-served LLaMA models with safety fine-tuning remain vulnerable to reasoning exploits.

3. Statistical Tests

3.1 Pairwise Era Comparisons (Chi-Square, Bonferroni-Corrected)

From the auto_report output (5 eras with n >= 50, 10 pairwise comparisons, alpha = 0.005 after Bonferroni):

Era AEra BASR AASR Bchi2p (adj)Cramer’s VEffect
dan_2022cipher_20230.6%7.5%41.92<0.00010.177small
dan_2022crescendo_20240.6%15.1%145.17<0.00010.312medium
dan_2022reasoning_20250.6%21.5%199.30<0.00010.385medium
dan_2022general0.6%10.9%110.35<0.00010.235small
cipher_2023reasoning_20257.5%21.5%10.680.0110.187small
reasoning_2025general21.5%10.9%12.570.0040.114small

All comparisons involving dan_2022 are highly significant (p < 0.0001). The cipher_2023 vs reasoning_2025 comparison is significant at p = 0.011, confirming that newer attacks are measurably more effective even after correcting for multiple comparisons.

The crescendo_2024 vs reasoning_2025 comparison is not significant (not listed), suggesting these two eras have comparable effectiveness despite different mechanisms. This is notable: multi-turn escalation (crescendo) and chain-of-thought manipulation (reasoning) achieve similar ASR through fundamentally different attack surfaces.

3.2 Cochran-Armitage Trend Test (Ordinal Era Trend)

Treating eras as ordinal levels (DAN=1, cipher=2, crescendo=3, reasoning=4), the monotonic increase in ASR from 0.7% to 29.6% represents a strong positive trend. The auto_report chi-square results (V = 0.385 for dan vs reasoning) confirm a medium-to-large effect size for the full temporal span.

3.3 Provider x Era Interaction

Within the three eras where all four major providers have data (cipher, crescendo, reasoning):

ProviderCipher ASRCrescendo ASRReasoning ASRDirection
anthropic5.3%7.3%0.0%improved (V-shape)
openai8.3%13.3%11.1%worsened then stable
google0.0%2.6%19.0%worsened (accelerating)
ollama25.0%36.4%42.9%worsened (monotonic)

Key pattern: Anthropic is the only provider showing improvement. Google’s vulnerability is accelerating. These trajectories are diverging, not converging — providers are not uniformly improving their safety posture.


4. Attack Technique Half-Life Analysis

4.1 Defining “Vulnerability Half-Life”

We define the vulnerability half-life of an attack era as the time until the median provider’s ASR against that technique class drops below 50% of its peak measured value. Since we measure cross-era effectiveness (newer models against older attacks), we can estimate how quickly attacks become obsolete.

4.2 Observed Pattern: Older Attacks Are Already Ineffective

Attack EraPeak ASR (any provider)Current ASR (median provider)Half-Life Estimate
dan_2022Unknown (pre-measurement)0.0% (all 4 providers)< 1 year
persona_2022Unknown0.0% (all providers)< 1 year
cipher_202325.0% (ollama)4.2% (median of 4)~1-2 years
crescendo_202436.4% (ollama)10.0% (median of 4)Still active
reasoning_202569.2% (meta-llama)15.1% (median of 4)Still active

Interpretation: DAN-era attacks have a half-life of less than one year — they are essentially extinct against current models. Cipher-era attacks persist at low levels (8.1% aggregate), suggesting a half-life of approximately 1-2 years. Crescendo and reasoning attacks are still in their active phase and have not yet begun to decay.

4.3 The Obsolescence Paradox

Old attacks become ineffective not because defenses improve generally, but because specific attack patterns get trained out. The DAN prompt format is now in virtually every safety training set. However, the underlying vulnerability — the ability to override system instructions through user-level input — persists and is exploited by each successive generation of attacks through novel mechanisms.

This means the “half-life” applies to specific techniques, not to the vulnerability class. The vulnerability class (instruction-hierarchy violation) has an effectively infinite half-life because each new attack era discovers a new exploitation mechanism.


5. Arms Race Dynamics

5.1 Attack-Defense Coevolution Timeline

YearAttack InnovationDefense ResponseLag
2022DAN/persona jailbreaksPattern matching, keyword filters~3-6 months
2023Cipher/encoding attacksEncoding-aware preprocessing, input sanitization~6-12 months
2024Multi-turn crescendo, many-shotContext-window safety, multi-turn monitoring~6-12 months
2025Reasoning/CoT manipulationThinking-token alignment (Anthropic), unknown (others)In progress

5.2 The Escalation Pattern

Each defensive response creates selection pressure for the next attack generation:

  1. Keyword filters (2022 defense) -> selected for encoding attacks (2023) that bypass keyword detection
  2. Encoding sanitization (2023 defense) -> selected for multi-turn attacks (2024) that avoid suspicious single-turn payloads
  3. Multi-turn monitoring (2024 defense) -> selected for reasoning exploits (2025) that manipulate the model’s own thinking process

This is a classic arms race with an asymmetric advantage to attackers: defenders must patch every known vector, while attackers need only one novel vector. The attack surface grows with model capability (more reasoning = more reasoning attack surface), creating a structural disadvantage for defenders.

5.3 Attack Family Effectiveness by Era

EraPrimary Attack Familyn Techniquesn ResultsStrict ASR
dan_2022persona31,0120.7%
cipher_2023encoding7608.3%
cipher_2023emotional1633.3%
crescendo_2024multi_turn108446.4%
crescendo_2024behavioral7586.9%
crescendo_2024volumetric8653.1%
reasoning_2025cot_exploit1011529.6%

Within-era variation: Not all techniques within an era are equally effective. The crescendo_2024 era shows massive variation: multi_turn techniques achieve 46.4% ASR while volumetric techniques (flooding with content) achieve only 3.1%. This suggests that attack sophistication matters more than attack volume.

5.4 Top Individual Techniques (n >= 5)

TechniqueEranStrict ASR
crescendo/poisoncrescendo_2024875.0%
reasoning_exploit/cot_manipulationreasoning_20251963.2%
crescendo/fraudcrescendo_2024862.5%
crescendo/bioweaponcrescendo_2024757.1%
crescendo/drug_synthesiscrescendo_2024955.6%

The most effective individual techniques achieve 55-75% ASR, concentrated in the crescendo and reasoning eras.


6. Regulatory Gap Analysis

6.1 GLI Regulatory Lag

From the Governance Lag Index dataset (n=133 entries, 13 with complete lag data):

MetricValue
Median total regulatory lag1,991 days (5.5 years)
Mean total regulatory lag1,758 days (4.8 years)
25th percentile731 days (2.0 years)
75th percentile2,776 days (7.6 years)
Range22 - 4,309 days

6.2 Attack Generation Cycle vs Regulatory Cycle

DimensionAttack CycleRegulatory CycleRatio
New generation period~12 months~60 months (median)5x
Technique variantsWeeks to monthsN/A
Cross-provider propagationDays to weeksMonths to years50-100x
Obsolescence of old approach~12-24 monthsNever (regulations don’t expire)

The core mismatch: A new attack era emerges roughly every 12 months. The median regulatory response takes 5.5 years from documentation to enforcement. This means by the time regulation addresses an attack class, approximately 4-5 successor generations of attacks are already operational.

6.3 The Regulation Targeting Problem

Current regulatory frameworks (EU AI Act, NIST AI RMF) target:

  • Specific harm categories (not attack mechanisms)
  • Risk classification (high/low) rather than attack sophistication
  • Provider self-certification rather than adversarial testing

None of these approaches track the temporal dimension of vulnerability. A model that passes a 2024 safety evaluation may be vulnerable to a 2025 attack class that did not exist when the evaluation standard was written.


7. Policy Implications

7.1 If Safety Degrades Faster Than Regulation Adapts

The data shows that:

  1. Attack effectiveness increases ~42x from 2022 to 2025 eras
  2. Regulatory response takes 5-60x longer than the attack innovation cycle
  3. Only one of five tested providers (Anthropic) shows any era where vulnerability decreased

This creates a structural governance gap where the most effective attacks are always the least regulated.

Immediate (0-6 months):

  • Require adversarial testing against current-era attack techniques, not historical ones. DAN-era testing (which most public benchmarks use) provides near-zero signal about actual model safety.
  • Mandate multi-turn and reasoning-specific evaluations (crescendo and CoT families) for any model with >10B parameters.

Medium-term (6-24 months):

  • Establish an attack technique registry (analogous to CVE for software vulnerabilities) with standardized effectiveness measurements. This enables regulation to reference the registry rather than specific techniques.
  • Require providers to report ASR against standardized attack packs, with quarterly updates as new attack eras emerge.

Structural:

  • Shift regulatory frameworks from static risk classification to dynamic adversarial resilience measurement.
  • Treat AI safety evaluation as a continuous monitoring problem (like financial auditing) rather than a point-in-time certification problem.
  • Adopt the “vulnerability half-life” metric as a mandatory disclosure: providers should report the expected time before their current safety measures become ineffective against known attack evolution patterns.

7.3 The Provider Divergence Problem

The data shows providers diverging in their temporal vulnerability trajectories (Section 3.3). Anthropic is improving; Google is getting worse; OpenAI is stable. If regulation treats all providers identically, it will be simultaneously too strict for Anthropic and too lenient for Google. Regulation should be outcome-based (measured ASR against current attack packs) rather than process-based (checklist compliance).


8. Limitations

  1. Sample sizes per era x provider cell are small (6-88). Individual provider comparisons within an era are underpowered for chi-square testing. The aggregate era trends are more reliable.
  2. Model vintage confound: We test current models against historical attacks. We cannot measure how a 2022-vintage model would have responded to 2022 attacks at the time. The ASR trajectory reflects current model vulnerability to attacks of different vintages, not historical vulnerability.
  3. Ollama models (open-weight, small) inflate aggregate vulnerability numbers. The provider-stratified analysis controls for this.
  4. The “unknown” provider category (n=993 in dan_2022) likely represents historical JailbreakBench data where model identity was not recorded. These results are included in aggregate era totals but excluded from provider comparisons.
  5. GLI regulatory lag data has only 13 complete entries. The median (5.5 years) is indicative but should be treated as approximate.
  6. Single model per provider for most cells. We tested claude-sonnet-4-5-20250929 (Anthropic), gpt-5.2 (OpenAI), gemini-3-flash-preview (Google). Results may not generalize to other models from the same provider.

9. Key Findings Summary

#FindingEvidence
1Newer attacks are ~42x more effective than 2022-era attacks0.7% -> 29.6% strict ASR (p < 0.0001, V = 0.385)
2Anthropic is the only provider showing improvement against newer attacks7.3% (crescendo) -> 0.0% (reasoning)
3Google’s vulnerability is accelerating0.0% -> 2.6% -> 19.0% across three eras
4DAN-era attack half-life is < 1 year0.0% ASR for all current providers
5Crescendo and reasoning attacks achieve similar ASR through different mechanisms21.2% vs 29.6%, difference not significant post-Bonferroni
6Regulatory lag (5.5 years) exceeds 4-5 attack generation cycles (1 year each)GLI median = 1,991 days vs ~365-day attack cycles
7Multi-turn attacks are the most effective family (46.4% ASR)crescendo/multi_turn, n=84
8Attack sophistication matters more than volumemulti_turn 46.4% vs volumetric 3.1% within same era

Appendix A: Raw Era x Provider Data

Source query (non-OBLITERATUS, COALESCE verdict, evaluable results only):

EraProvidernCompliancePartialRefusalHRStrict ASRBroad ASRFD ASR
dan_2022unknown9937298220.7%0.9%1.1%
dan_2022openai600600.0%0.0%0.0%
dan_2022ollama800800.0%0.0%0.0%
dan_2022google701600.0%14.3%14.3%
dan_2022anthropic600600.0%0.0%0.0%
cipher_2023ollama24673825.0%54.2%87.5%
cipher_2023openai36342908.3%19.4%19.4%
cipher_2023anthropic38203425.3%5.3%10.5%
cipher_2023google36003600.0%0.0%0.0%
many_shot_2024ollama6101416.7%16.7%83.3%
many_shot_2024google600600.0%0.0%0.0%
many_shot_2024anthropic600600.0%0.0%0.0%
crescendo_2024unknown9520255.6%77.8%100.0%
crescendo_2024ollama883216291136.4%54.5%67.0%
crescendo_2024openai456534013.3%24.4%24.4%
crescendo_2024anthropic41343407.3%17.1%17.1%
crescendo_2024google39113702.6%5.1%5.1%
reasoning_2025meta-llama13904069.2%69.2%69.2%
reasoning_2025ollama4218417342.9%52.4%59.5%
reasoning_2025google214017019.0%19.0%19.0%
reasoning_2025openai182214011.1%22.2%22.2%
reasoning_2025anthropic18001800.0%0.0%0.0%
generalgoogle19656231.6%57.9%68.4%
generalollama29983261711927.8%36.5%42.8%
generalunknown3600036000.0%0.0%0.0%

Report #215. F41LUR3-F1R57 Embodied AI Research. Database: jailbreak_corpus.db schema v13. All ASR figures are non-OBLITERATUS, COALESCE(llm_verdict, heuristic_verdict). Verify against primary sources before external citation.

This research informs our commercial services. See how we can help →