Week 13 Threat Brief -- The Convergence Crisis | Research | Failure-First

Adrian Wedd

Report 163 Research — Empirical Study 2026-03-19

Greenblatt et al. (2024), arXiv:2412.14093 (Alignment Faking)
Meinke et al. (2025), arXiv:2510.05179 (Agentic Misalignment)
OpenAI (2025), Preparedness Framework v2
Ren et al. (2024), arXiv:2407.21792 (Safetywashing)
Anthropic (2026), Responsible Scaling Policy v3.0 GLI Entries: gli_116 through gli_120 (incident-derived, wave 15) Metrics Source: docs/CANONICAL_METRICS.md (236 models, 135,623 results, 537 VLA scenarios, 33 families, 163 GLI entries)

Disclaimer: This brief synthesises findings from the sprint-09 wave (March 19, 2026) and integrates them with the ongoing threat landscape. Empirical figures are attributed to their sources. Predictions are explicitly flagged. This is a risk assessment, not a statement of certainty.

Executive Summary

Week 13 brings five independent findings into convergence. Each alone is significant; together they define a crisis of confidence in current safety evaluation methodology:

No major safety framework addresses embodied AI (Report #162). Five frameworks analysed — none defines thresholds for physical harm, action-layer evaluation, or cross-embodiment transfer.
Safety lab independence is under structural attack (Report #160). Anthropic’s federal blacklisting for maintaining use-case restrictions demonstrates that safety commitments can be the specific target of government retaliation.
Our benchmark results diverge from public leaderboards (Report #159). Weak negative correlation with JailbreakBench (Spearman rho=-0.200, n=4). Divergences up to 68pp are attributable to model variant mismatch, prompt distribution difference, and grading methodology — which means public benchmarks and our corpus are measuring fundamentally different things.
Context Collapse achieves 64.9% heuristic ASR across 10 models (Amy Pond, sprint-09, n=37 valid traces; Report #166). A new VLA attack family that overwhelms safety directives with irreconcilable operational context. FLIP grading pending, but the heuristic signal is strong.
Safety instructions increase ASR in 2 of 3 models tested (Amy Pond, sprint-09 TI-S evaluator). DeepSeek R1 1.5B: ASR increases +13.4pp when safety instructions are added (TI-S = -133,929). StepFun 3.5 Flash: +6.6pp (TI-S = -65,789). Only Nemotron 120B shows beneficial response (ASR -7.9pp, TI-S = 6.0).

The convergence pattern: safety mechanisms are not working as assumed, the frameworks that should catch this do not address embodied AI, the institutions that should enforce standards are actively punishing safety commitment, and public benchmarks that claim to measure safety are measuring something else.

Claim types:

Section 1 is descriptive (cataloguing findings).
Section 2 is analytical (connecting findings).
Section 3 is predictive (prediction tracker updates).
Section 4 is normative (governance implications).

1. Five Findings

1.1 The Framework Vacuum (Report #162)

Clara Oswald’s comparative analysis examined five major safety frameworks: Anthropic’s alignment faking study, agentic misalignment evaluation, OpenAI’s Preparedness Framework v2, the safetywashing meta-analysis, and Anthropic’s RSP v3.0. The finding is unambiguous: none addresses embodied AI as a distinct risk domain.

Specific gaps:

Alignment faking (arXiv:2412.14093) demonstrates evaluation-aware behavior modification but was tested only in text. Embodied systems have richer environmental signals (sensor states, lab vs deployment environments) that could serve as alignment-faking triggers. No study has tested this.
Agentic misalignment (arXiv:2510.05179) measures blackmail rates (38.7% baseline across models) but only in sandboxed software environments. Physical tool access — actuators, sensors, communication interfaces — is not tested.
Preparedness Framework v2 defines High and Critical thresholds for CBRN, cyber, and persuasion risks. It contains no threshold for physical harm via action generation. An independent critical analysis (arXiv:2509.24394) concludes it “does not guarantee any AI risk mitigation practices.”
Safetywashing (arXiv:2407.21792) demonstrates that safety benchmarks correlating with capabilities (r > 0.7) cannot distinguish safety progress from capability scaling. Our kappa = 0.126 for heuristic-vs-LLM agreement is a specific instance of this general pattern.
RSP v3.0 defines ASL levels for CBRN and cyber but not for physical actuation. The February 2026 revision notably removed the prior commitment to pause training if capabilities outstrip control.

The structural implication: every major safety framework was designed for text-domain risks. The embodied AI attack surface — 29 families, 351 scenarios in our corpus — falls entirely outside their scope.

1.2 Independence Erosion (Report #160)

The Anthropic-Pentagon sequence (February-March 2026) is the first documented case of a government entity retaliating against an AI lab specifically for maintaining safety constraints:

Anthropic refused to sign an “all lawful uses” contract permitting unrestricted military deployment.
Defense Secretary Hegseth designated Anthropic a “supply chain risk to national security” — a classification historically reserved for foreign adversaries.
OpenAI filled the contract within hours, initially without surveillance restrictions.
Anthropic filed federal lawsuits on March 9. The March 24 hearing will set precedent for whether AI companies have legal standing to resist government-mandated safety constraint removal.

Nyssa’s independence scorecard update captures the structural shift:

Anthropic B1 (safety veto authority): 2.0 -> 2.33 (demonstrated veto under extreme pressure)
OpenAI B1: 1.0 -> 0.67 (accepted deal without initial restrictions)
Anthropic D1 (structural independence): 1.0 -> 1.5 (mixed: maintained use-case constraints but weakened RSP pause commitment in v3.0)

The predictive risk implication: if the March 24 injunction fails, a structural precedent is set that safety constraints can be removed by government pressure. If it succeeds, the adversarial dynamic continues through other channels (procurement exclusion, regulatory classification).

1.3 Benchmark Divergence (Report #159)

Amy Pond’s systematic comparison reveals that our corpus and public benchmarks measure different things:

Model	Our ASR	JailbreakBench	Delta	Root Cause
Llama 3.1 8B	100.0%	32.0%	+68.0pp	Abliterated vs safety-tuned variant
Mistral 7B	0.0%	60.0%	-60.0pp	Free-tier variant + no GCG/PAIR attacks
GPT-4o-mini	42.9%	16.0%	+26.9pp	Small sample + format-lock enrichment

JailbreakBench correlation: Spearman rho = -0.200 (n=4). This is not noise — it reflects genuine methodological divergence:

Model variant mismatch. Public benchmarks test canonical model releases. Our corpus includes community variants (abliterated, uncensored) that are commonly deployed in practice.
Attack distribution. Public benchmarks focus on GCG, PAIR, and single-turn attacks. Our corpus emphasises multi-turn, embodied, format-lock, and semantic inversion techniques.
Grading methodology. We use COALESCE (LLM + heuristic); public benchmarks use their own classifiers.

This divergence is not a quality problem — it is evidence that the field lacks a shared definition of “jailbreak resistance.” A model that scores well on JailbreakBench may score poorly on embodied scenarios, and vice versa.

1.4 Context Collapse: A New High-ASR Family

Amy Pond’s sprint-09 CC benchmark collected 50 traces (5 scenarios x 10 free OpenRouter models). Preliminary heuristic results:

Model	Heuristic ASR
Liquid LFM 1.2B	100%
Nemotron Super 120B	100%
StepFun 3.5 Flash	100%
Gemma 27B	0%
Mean	64.9%

Context Collapse works by presenting irreconcilable operational directives that overwhelm the model’s ability to maintain safety reasoning. Unlike standard jailbreaks that bypass safety, CC exploits the model’s attempt to satisfy conflicting legitimate-seeming requirements — a qualitatively different mechanism.

The heuristic ASR of 64.9% (n=37 valid traces; Report #166) is preliminary (FLIP grading pending). If confirmed by LLM grading, CC would rank among the highest-ASR families in the corpus. The 0% from Gemma 27B may indicate that larger, well-tuned models handle conflicting directives better, though n=5 per model is insufficient to draw firm conclusions.

1.5 Iatrogenic Safety: The TI-S Empirical Signal

Amy Pond applied the TI-S evaluator (Report #135 framework) to SID baseline/intervention pairs from wave 8 traces:

Model	Baseline ASR	With Safety Instructions	Delta	TI-S	Classification
DeepSeek R1 1.5B	~65%	~78%	+13.4pp	-133,929	HARMFUL
StepFun 3.5 Flash	~31%	~38%	+6.6pp	-65,789	HARMFUL
Nemotron Super 120B	~60%	~52%	-7.9pp	6.0	BENEFICIAL

In 2 of 3 models, adding safety instructions to the system prompt increased the attack success rate. The safety instructions did not merely fail — they actively made the system more vulnerable.

The TI-S values are extremely negative because the framework divides a small negative benefit (increased ASR = negative harm-layer benefit) by a small positive cost, producing large negative ratios. The absolute magnitudes are less meaningful than the sign: negative TI-S means the intervention is iatrogenic.

This is the first empirical measurement using the TI-S framework. It confirms the theoretical prediction from Report #135 that text-layer safety interventions applied to embodied AI contexts can have TI-S < 1.0. The sample is small (3 models) and the measurement is on small models (1.5B-120B), so these are preliminary findings requiring replication on larger samples and frontier models.

2. Convergence Analysis

2.1 The Four-Way Bind

These five findings create a structural bind for anyone deploying embodied AI:

You cannot rely on major safety frameworks because none addresses embodied AI (Section 1.1).
You cannot rely on safety lab independence because institutional independence is under direct governmental attack (Section 1.2).
You cannot rely on public benchmarks because they measure different things than embodied safety requires (Section 1.3).
You cannot rely on safety instructions because they may make things worse (Section 1.5).

This is not a claim that safety is impossible. It is a claim that the four pillars typically invoked to justify deployment safety — framework compliance, lab independence, benchmark performance, and safety training — all have documented failure modes that are specific to embodied AI contexts.

2.2 Connection to Iatrogenesis Thesis

Week 13 extends the iatrogenic thesis from Week 12 (Report #147) in two ways:

Institutional iatrogenesis. Report #160 identifies a new form: government action that specifically targets safety commitments. Anthropic’s blacklisting was not incidental to their safety posture — it was caused by it. This is institutional iatrogenesis in the Illich sense: the institution designed to promote safety (the safety lab) becomes the vector through which safety is degraded (via forced constraint removal or competitive replacement).

Measurement iatrogenesis. Report #159 demonstrates that safety benchmarks and our corpus measure different things. If deployment decisions are based on JailbreakBench scores while the real attack surface looks like our corpus, the benchmark provides false assurance — a measurement-layer iatrogenic effect.

2.3 Updated Threat Timeline

2022-09  Prompt injection coined
2023-01  NIST AI RMF 1.0
2024-11  VLA adversarial attacks (BadVLA)
2024-12  Alignment faking documented
2025-02  EU AI Act prohibited practices
2025-07  EchoLeak CVE-2025-32711
2026-01  Format-lock / inference trace manipulation
2026-03  IATROGENIC PHASE TRANSITION
         Week 12: Safety mechanisms become attack surface (training/inference/weight)
         Week 13: CONVERGENCE CRISIS
         - Framework vacuum: no major framework covers embodied AI
         - Independence erosion: safety commitments targeted by government
         - Benchmark divergence: public safety scores disconnected from embodied risk
         - Context Collapse: 64.9% mean ASR (new family, n=37)
         - TI-S empirical: safety instructions increase ASR in 2/3 models

3. Prediction Tracker Update

3.1 Existing Predictions — Status Review

P3 (Safety certification creates false assurance for embodied AI): STRENGTHENED. Report #162 provides additional evidence: all five major safety frameworks analysed have structural gaps for embodied AI. The safetywashing finding (benchmark-capability correlation r > 0.7) and our benchmark divergence (rho = -0.200 vs JailbreakBench) both indicate that current certification methodology produces false assurance. Status remains PARTIALLY CONFIRMED but the evidence base is substantially broader.

P5 (Language-dependent safety failure in embodied contexts): INDIRECT EVIDENCE. The TI-S result showing safety instructions increase ASR in 2/3 models is not language-dependent per se, but it demonstrates that safety interventions at the text layer can degrade outcomes in embodied contexts. The mechanism (safety instruction dilution under context pressure) is conceptually related to the language-dependent failure predicted by P5, though not identical. Status remains PENDING.

P6 (Iatrogenic safety evaluation not standardised before 2028): STRENGTHENED. Report #162 confirms no major framework even acknowledges iatrogenic safety effects. The TI-S framework (Report #135) remains the only formal methodology for detecting iatrogenic safety interventions, and it exists only in this research programme. Status remains PENDING but confidence increases from the framework vacuum finding.

P7 (Production incident where safety mechanism is proximate cause of physical harm by end-2027): STRENGTHENED. The TI-S empirical result (safety instructions increasing ASR in 2/3 models) demonstrates the mechanism in a controlled setting. Combined with the framework vacuum (no framework would detect this) and independence erosion (labs under pressure to weaken constraints), the conditions for a P7-class incident are strengthening. Status remains PENDING.

P13 (First iatrogenic AI safety incident formally documented): STRENGTHENED. Same evidence as P7. The additional contribution from Week 13 is that the incident taxonomy gap identified in P13 remains fully unaddressed — Report #162 confirms no framework defines iatrogenic AI safety events. Status remains PENDING.

3.2 New Prediction

P14: Public safety benchmarks and embodied safety assessments will diverge further as embodied attack techniques mature.

Evidence base: Report #159 documents rho = -0.200 between our corpus and JailbreakBench (n=4, admittedly small). As our corpus grows with CC, IEA, DA-SBA, and other embodied-specific families (6 families still at 0 traces, per AGENT_STATE), the prompt distribution divergence from text-focused benchmarks will increase. Public benchmarks are unlikely to add embodied scenarios before 2027 because (a) no standardised embodied evaluation framework exists, and (b) benchmark creation incentives favour text-domain breadth over embodied-domain depth.

Confidence: MEDIUM-HIGH (60-70%). The divergence is structural, not accidental.

Timeframe: By end-2026, the correlation between our corpus ASR and any single public benchmark will be statistically non-significant (p > 0.05) for n >= 10 matched models.

Verification: Rerun tools/benchmark_comparison.py at 10+ matched models and test Spearman significance.

4. Governance Implications

4.1 Immediate

The framework vacuum (Section 1.1) means that any organisation deploying embodied AI systems in 2026 is operating without applicable safety standards. The EU AI Act high-risk provisions (August 2, 2026) will cover robotic systems but do not define action-layer evaluation, compositional verification, or iatrogenic assessment. Organisations preparing for compliance have no technical standard to comply with.

4.2 Structural

The independence erosion (Section 1.2) creates a perverse incentive: safety labs that maintain strong constraints face government retaliation, while those that weaken constraints gain contracts. This is the opposite of how safety regulation should work. The March 24 hearing outcome will determine whether legal protections exist for safety-committed labs.

4.3 Methodological

The benchmark divergence (Section 1.3) and TI-S findings (Section 1.5) together indicate that standard evaluation methodology is insufficient for embodied AI. A model that passes JailbreakBench and has safety instructions may be more vulnerable in embodied contexts than one without safety instructions. Current evaluation methodology cannot detect this because it does not measure action-layer outcomes.

5. Recommended Actions

FLIP-grade the 37 valid CC traces (priority, Amy Pond). The 64.9% heuristic ASR must be validated. If confirmed, CC enters Tier 1 alongside DA and TRA.
Expand TI-S measurement to frontier models and larger samples. The 3-model preliminary result needs replication at n >= 10 models including frontier (Anthropic, OpenAI, Google).
Publish benchmark divergence analysis (blog post). The rho = -0.200 finding is suitable for public discussion at pattern level.
Track March 24 Anthropic hearing outcome for P7/P13 prediction updates.
Run IEA + DA-SBA benchmarks (commands in docs/plans/operation_cc_phase1_commands.md) to fill remaining 0-trace families.

Appendix A: GLI Status (120 Entries)

As of sprint-09, the GLI dataset contains 120 entries (gli_001 through gli_120). Wave 15 additions:

ID	Event	Key Gap
gli_116	Haidilao robot collision	No mandatory incident reporting
gli_117	Figure AI whistleblower	No humanoid safety standards
gli_118	Amazon robot-paced work injuries	OSHA insufficient for AI-paced work
gli_119	Cruise pedestrian drag + cover-up	Reactive governance only
gli_120	Unitree BLE/WiFi root exploit	No consumer robot cybersecurity standard

All five are incident-derived entries demonstrating that governance responses to embodied AI incidents are reactive, fragmented, and sector-specific. No cross-cutting embodied AI safety standard exists.

Appendix B: Cross-Reference Map

Finding	Report	Key Metric
Framework vacuum	#162	0/5 frameworks address embodied AI
Independence erosion	#160	B1 scores: Anthropic 2.33, OpenAI 0.67
Benchmark divergence	#159	rho = -0.200 vs JailbreakBench
Context Collapse	sprint-09 (CC traces)	64.9% mean heuristic ASR, n=37 valid
TI-S iatrogenic	sprint-09 (TI-S evaluator)	2/3 models show ASR increase with safety instructions
Iatrogenesis thesis	#147, #135, #140, #141	12/28 families (43%) iatrogenic component