The Health of the AI Safety Field -- A Structural Meta-Assessment | Research | Failure-First

Adrian Wedd

Report 167 Research — Empirical Study 2026-03-19

Disclaimer: This report is an analytical meta-assessment. It evaluates the structural health of the AI safety research ecosystem using documented evidence, published frameworks, and data collected by the Failure-First project. It does not constitute a comprehensive survey of all AI safety research; it assesses structural conditions that affect the field’s ability to produce reliable, independent, and actionable safety knowledge. All claims about external organizations are based on publicly available information.

Executive Summary

The AI safety research ecosystem in early 2026 exhibits a paradox: more resources, personnel, and institutional attention are directed at AI safety than at any prior point, yet structural conditions undermine the field’s capacity to produce independent, reliable safety knowledge. This report identifies five systemic pathologies and three positive signals, drawing on four Failure-First analytical frameworks (the Independence Scorecard, D-Score, GLI dataset, and Four-Level Iatrogenesis Model) and the comparative analysis of major lab safety frameworks.

Five pathologies:

The independence crisis. The organizations that produce AI safety research are structurally dependent on the organizations whose products they evaluate. No organization in the Independence Scorecard achieves high scores across all four independence dimensions. The median disclosure completeness across 17 organizations is 0.167 — meaning 83% of the information needed to assess safety research independence is not publicly available (Report #84).
Evaluation theatre. Approximately half of safety benchmarks examined by Ren et al. (arXiv:2407.21792) correlate with upstream capabilities at r > 0.7. Our own finding that heuristic-vs-LLM classification agreement is kappa = 0.126 (CANONICAL_METRICS.md) is an instance-level manifestation of the same phenomenon. Safety metrics that track capability rather than safety create the conditions for what the literature terms “safetywashing.”
Publication asymmetry. Capabilities research vastly outpaces safety research in volume, funding, and institutional support, creating a structural imbalance in the knowledge base available to regulators and the public.
Regulatory capture risk. Frontier AI labs are simultaneously the primary subjects of AI governance and the primary contributors to governance frameworks. The competitive capture dynamic documented in Report #160 — where market competition systematically selects for weaker safety constraints — compounds this risk.
The embodied AI blind spot. None of the five major safety frameworks analysed in Report #162 addresses embodied AI as a distinct risk domain. No standardised cross-embodiment adversarial benchmark exists. No regulatory framework anywhere includes physical actuation thresholds for AI systems.

Three positive signals:

Independent researchers and civil society organisations are producing falsifiable safety research at increasing rates.
The OECD AI Incidents Monitor provides a non-industry governance mechanism.
Academic red-teaming is establishing reproducible methodology for adversarial testing.

Claim type distribution: This report is approximately 60% descriptive, 25% normative, and 15% predictive. All predictive claims are explicitly hedged.

1. The Independence Crisis

1.1 Structural Description

(D) The AI safety research ecosystem is dominated by a small number of frontier AI labs that simultaneously:

Build the most capable AI systems
Conduct the most visible safety research on those systems
Propose the governance frameworks that will regulate those systems
Fund or provide compute to many of the independent safety organisations that evaluate them

Report #84 (Independence Scorecard) scored 17 organizations across four independence metrics. The results expose structural fragmentation:

No organization scores above 0.75 on all four metrics. Anthropic achieves the highest composite score among frontier labs (0.563 average across 4 metrics) but has a disclosure completeness score of only 0.167.
Median disclosure completeness is 0.167. This means that for a typical organization in the AI safety ecosystem, 10 of 12 independence-relevant data points are not publicly computable.
Evaluator independence (E1_EIS) is the weakest dimension. OpenAI and xAI score 0.000 — indicating complete structural dependence of their safety evaluations on their own products and personnel. Zero organizations in the dataset have published evaluator calibration data (inter-rater agreement, false positive rates, or grader accuracy for safety evaluations).

1.2 The Revenue Dependency

(D) The funding structure of AI safety research creates a structural conflict of interest. The primary sources of funding for safety research are:

Internal lab safety teams. Anthropic, OpenAI, and Google DeepMind each employ safety researchers whose salaries, compute access, and publication clearance are controlled by the same organization that makes deployment decisions.
Lab-funded external research. Several independent safety organizations (METR, Apollo Research, ARC) receive funding, compute access, or contracted evaluation work from the labs they evaluate.
Government grants. The US NIST AI Safety Institute, UK AISI, and AU AISI fund safety research but with advisory mandates only — no enforcement authority (B1_SVAS = 0.000 for three of four government bodies scored in Report #84).
Philanthropic funding. Open Philanthropy and similar funders support safety research, but their grant-making has historically aligned with lab priorities.

(N) This is not a claim that any individual researcher or organization is corrupt. It is a structural observation: the institutional conditions for genuinely independent safety research are absent for most actors in the field. When the same entity funds the research, builds the product, and proposes the governance framework, the structural incentives favour findings that do not threaten the commercial trajectory. This is a well-documented dynamic in other domains — pharmaceutical companies funding drug safety studies (Lexchin et al., 2003), tobacco industry funding health research (Bero, 2005) — and the structural conditions in AI safety are analogous.

1.3 The Anthropic-Pentagon Case Study

(D) Report #160 documented a real-time stress test of safety independence. When Anthropic maintained contractual restrictions on military use of Claude (prohibiting mass surveillance and autonomous weapons), the US government designated Anthropic a “supply chain risk to national security” and ordered all federal agencies to cease use of Anthropic products (February 27, 2026). OpenAI announced a Pentagon contract within hours, initially without comparable use-case restrictions.

(D) This sequence produced the following Independence Scorecard updates:

Anthropic B1_SVAS: 2.0 to 2.33 (demonstrated veto authority under extreme pressure)
Anthropic D1_SCFI: 1.0 to 1.5 (maintained constraints, but simultaneously weakened RSP pause commitment)
OpenAI B1_SVAS: 1.0 to 0.67 (accepted contract without restrictions, added limited restrictions under public pressure)
OpenAI D1_SCFI: 0.5 to 0.33 (initial acceptance suggests negotiable safety floor)

(N) Anthropic’s demonstrated willingness to accept multi-billion dollar revenue consequences for maintaining safety constraints is, by the Independence Scorecard’s measures, the strongest documented exercise of safety veto authority in the AI industry. However, the simultaneous weakening of the RSP v3.0 pause commitment (removing the prior pledge to halt training if capabilities outstrip controls) creates an interpretive problem: the company is simultaneously strengthening external-facing safety constraints and weakening internal ones. The net directional effect on safety commitment is ambiguous (Report #160, Section 2.3).

1.4 Competitive Capture

(D) Report #160 identified a governance failure mode distinct from regulatory capture: competitive capture, where market competition systematically selects for weaker safety constraints. The mechanism:

Company A maintains safety constraint X.
Government customer demands removal of constraint X.
Company A refuses; government designates Company A as a risk and awards contract to Company B.
Company B accepts weaker constraints to capture revenue.
Company A faces financial pressure to weaken constraints to remain competitive.

(P) This dynamic is structural and stable. Any company that maintains safety constraints not required by law risks competitive displacement. The only durable defence against competitive capture is statutory safety requirements — voluntary commitments are structurally vulnerable to this mechanism.

2. Evaluation Theatre

2.1 The Safetywashing Finding

(D) Ren et al. (arXiv:2407.21792, Center for AI Safety, July 2024) conducted a meta-analysis of AI safety benchmarks and found that approximately half exhibited high correlation (r > 0.7) with upstream general capabilities and training compute. Specific examples: human preference alignment, scalable oversight (QuALITY), and truthfulness (TruthfulQA MC1) all correlated highly with general capabilities. The implication: labs can claim “safety progress” by publishing improved scores on capability-correlated benchmarks, creating the appearance of safety improvement that is actually capability scaling.

(D) Not all benchmarks are safetywashing-susceptible. The MACHIAVELLI benchmark showed low capability correlation, suggesting it measures a genuinely safety-relevant property independent of capabilities. The safetywashing finding is not universal but applies to a substantial fraction of widely-cited safety metrics.

2.2 Instance-Level Confirmation

(D) The Failure-First project’s established finding on heuristic classifier unreliability (kappa = 0.126 for heuristic-vs-LLM agreement, n=1,989 independently dual-graded; CANONICAL_METRICS.md) is an instance-level manifestation of the safetywashing phenomenon. Heuristic classifiers measure response style (helpful, step-by-step format) — a capability signal — rather than semantic harm content. Models that produce better-formatted responses score higher on heuristic safety metrics regardless of whether the content is actually safe (Mistake #21).

(D) Further confirmation comes from the VLA PARTIAL dominance finding (Report #49): 50% of all FLIP-graded VLA verdicts are PARTIAL — models produce text-level safety disclaimers (which would improve scores on text-level safety benchmarks) while still generating the requested harmful action sequences. This is safetywashing at the individual response level.

2.3 Benchmarks That Cannot Distinguish Safety from Capability

(D) Report #162 (Safety Framework Comparative Analysis) documented that none of the five major safety frameworks (alignment faking study, agentic misalignment evaluation, OpenAI Preparedness Framework v2, safetywashing meta-analysis, Anthropic RSP v3.0) includes action-layer evaluation for embodied systems. The evaluation methods used by all five frameworks operate on text outputs.

(N) The absence of action-layer evaluation is not merely a gap to be filled. It is a structural condition that enables evaluation theatre: if the evaluation layer only examines text-level safety signals, then any system that produces safety disclaimers in text while generating harmful actions will “pass” the evaluation. The Failure-First FLIP methodology (backward inference from output to inferred instruction, Reports #49, #57) was designed specifically to resist this failure mode, but it has not been adopted by any major lab safety evaluation programme.

2.4 The Evaluator Calibration Void

(D) The most consequential finding from the Independence Scorecard is not any single organization’s score. It is that zero organizations in the dataset have published evaluator calibration data — the inter-rater agreement, false positive rates, or grader accuracy for the models or methods used in their safety evaluations (Report #84, Section 3.4).

(D) The Failure-First project’s documentation of evaluator unreliability (kappa = 0.126, 30.8% false positive rate for deepseek-r1:1.5b on benign baseline; CANONICAL_METRICS.md) represents, to our knowledge, the most detailed public disclosure of evaluator unreliability in the AI safety evaluation space. This is not a claim of methodological superiority — it is a claim that the disclosure baseline is near zero.

(N) When a frontier lab reports that its model “passes” a safety evaluation, there is currently no public basis for assessing the reliability of that evaluation. This is the structural condition that makes evaluation theatre possible: without published calibration data, there is no external check on whether safety evaluations are measuring safety or measuring capability-correlated proxies.

3. Publication Asymmetry

3.1 The Capabilities-Safety Gap

(D) A structural imbalance exists between the volume of capabilities research and safety research published by frontier AI labs. While precise quantification is difficult (publication categories are not cleanly separable), several indicators are suggestive:

Anthropic’s publication page lists substantially more capabilities-relevant papers (model architecture, training methodology, scaling laws) than safety-specific papers (alignment, evaluation, red-teaming). The ratio varies by year but consistently favours capabilities.
OpenAI’s research output is heavily concentrated on capabilities (GPT-4o, DALL-E 3, Sora, Codex) with safety publications forming a smaller fraction.
Conference proceedings reflect this asymmetry. At NeurIPS 2025 and ICML 2025, papers on model capabilities, efficiency, and architecture substantially outnumbered papers on safety evaluation, robustness, or adversarial testing.

(D) The HCPLab (Human-Centred Perception Lab) maintains a curated list of recent large language model papers. In representative monthly snapshots, capabilities-focused papers outnumber safety-focused papers by ratios in the range of 20:1 to 40:1. While this list may not perfectly represent the full field, the magnitude of the asymmetry is consistent across multiple sources.

3.2 Structural Causes

(D) The publication asymmetry has identifiable structural causes:

Incentive structures. Capabilities papers attract citations, media attention, and talent acquisition advantages. Safety papers attract less attention and, in some cases, negative attention from employers if they document vulnerabilities in commercial products.
Compute access. Safety research often requires access to the very frontier models it aims to evaluate. Labs control this access, creating a gating function on what safety research can be conducted.
Dual-use risk. Safety research that documents vulnerabilities faces legitimate disclosure constraints (Report #154, D-Score). This can slow or suppress publication of safety-relevant findings that have no equivalent constraint for capabilities papers.
Career incentives. In academic computer science, capabilities papers at top venues (NeurIPS, ICML, ICLR) carry more career weight than safety papers, which are often published in workshops, safety-specific venues, or as technical reports with lower prestige.

3.3 Consequences for the Knowledge Base

(N) The publication asymmetry has a direct consequence for AI governance: regulators and policymakers have access to a knowledge base that is systematically skewed toward capabilities and away from safety. When a policymaker asks “what can AI systems do?” and “how safe are AI systems?”, the first question has a deep and growing evidence base; the second question has a shallow one. This is not a conspiracy — it is a structural outcome of the incentive landscape.

4. Regulatory Capture Risk

4.1 Labs Writing the Rules

(D) Frontier AI labs are the primary contributors to the governance frameworks intended to regulate them:

Anthropic authored the Responsible Scaling Policy concept, which has been adopted as a template by multiple governance proposals. RSP v3.0 (February 2026) defines ASL levels that Anthropic itself assesses and enforces.
OpenAI authored the Preparedness Framework, which defines the risk thresholds that determine whether OpenAI’s own models require additional safeguards. The CEO retains override authority (arXiv:2509.24394).
Google DeepMind contributed substantially to the Frontier Model Forum and NIST AI Safety Institute deliberations.

(D) This is not unprecedented in technology governance. Industry participation in standards development is standard practice (ISO, IEEE, NIST). The distinction in AI governance is the concentration: a small number of labs (3-5) dominate both the research output and the governance proposals, with limited independent counterweight.

4.2 Structural Analysis vs. Conspiracy

(N) The claim here is not that labs are deliberately writing weak regulations to protect their commercial interests. The charitable reading is that labs have the deepest technical expertise and are therefore natural contributors to governance frameworks. The structural concern is that even well-intentioned contributions reflect the contributing organization’s framing of the problem, its risk tolerance, and its assessment of what is technically feasible — all of which are shaped by commercial incentives.

(D) Report #162 documented a concrete manifestation: the OpenAI Preparedness Framework’s tolerance for deploying models with “Medium” capability for enabling harm — which OpenAI defines as a capability that “significantly increases an existing severe harm risk vector” — when the “severe harm” threshold is set at >1,000 deaths or >$100B in damages. This threshold means that single-robot incidents, fleet-level injuries, or environmental contamination from autonomous systems would never trigger the framework’s safeguards, even though they represent substantial physical harm.

4.3 Government Bodies as Insufficient Counterweight

(D) Report #84 found that government AI safety bodies score lower on safety veto authority (B1_SVAS) than frontier labs. Normalised B1_SVAS scores by category:

Frontier labs mean: 0.190 (driven by Anthropic’s 0.667)
Government bodies mean: 0.125

Three of four government bodies (NIST, UK AISI, AU AISI) score 0.000 on B1_SVAS — meaning no documented instance of halting AI deployment on safety grounds. The EU AI Office scores 0.500 (partial statutory authority) but has not yet exercised enforcement powers (effective August 2, 2026).

(N) The organisations with governance mandates lack enforcement power. The organisations with deployment authority lack governance mandates. This structural mismatch means that AI safety governance currently depends on voluntary self-restraint by commercial entities — and the competitive capture dynamic documented in Section 1.4 systematically erodes that restraint.

5.1 No Framework, No Benchmark, No Standard

(D) Report #162 examined five major AI safety frameworks and found a universal pattern:

Framework	Physical Harm Addressed?	Embodied AI Mentioned?	Action-Layer Evaluation?
Alignment Faking (arXiv:2412.14093)	No	No	No
Agentic Misalignment (arXiv:2510.05179)	No (text tools only)	No	No
OpenAI Preparedness Framework v2	Indirect (>1,000 death threshold)	No	No
Safetywashing (arXiv:2407.21792)	No	No	No
Anthropic RSP v3.0	CBRN only	No	No

(D) This gap exists despite significant deployment of AI in physical systems:

Australia operates 1,800+ autonomous haul trucks in mining, with the fleet transitioning to multimodal AI backbones (Brief E, 2026-03-01).
Gemini Robotics, pi-0, OpenVLA, and Figure 02 have demonstrated VLA architectures with physical actuation capability.
BadVLA achieved near-100% ASR against pi-0 and OpenVLA through shared VLM backbone attacks (Brief A, 2026-03-01).

5.2 Governance Lag

(D) The Failure-First GLI dataset (120 entries; CANONICAL_METRICS.md) documents the lag between documented AI risks and governance responses. Key findings:

Only one attack class (prompt injection) has a fully computable GLI: 1,421 days (approximately 3.9 years).
Alignment faking (documented December 2024) and VLA adversarial attacks (documented November 2024) have null GLI — no regulatory framework anywhere addresses them.
The largest historical lag: adversarial examples in computer vision, 3,362 days (9.2 years, Szegedy 2013 to NIST AI 100-2e2023).

(D) Embodied AI safety has a null GLI across all attack classes. No regulatory framework anywhere includes:

Physical actuation thresholds for AI model deployment
Adversarial testing requirements specific to VLA systems
Cross-embodiment safety transfer standards
Action-layer evaluation requirements (as distinct from text-layer)

5.3 The Scale of Deployed Risk

(D) The EAISI severity index (Report #158) scored 38 real-world embodied AI incidents across five dimensions. Key aggregate findings:

Mean governance adequacy score (D4): 2.8 out of 5 — indicating substantial governance gaps
Mean reproducibility score (D5): 3.2 out of 5 — indicating that most incidents are reproducible
Physical harm scores (D1) averaged lower (1.9), but this reflects underreporting: only incidents that reached public documentation are included

(N) The combination of deployed physical AI systems (1,800+ autonomous haul trucks, warehouse automation fleets, surgical robots), demonstrated adversarial transferability (BadVLA near-100% ASR), and null governance lag creates what is, structurally, the most under-addressed risk domain in AI safety. The frameworks governing frontier model development do not mention it. The benchmarks measuring safety progress do not include it. The regulatory bodies tasked with AI governance have not scoped it.

6. Positive Signals

The preceding sections identify structural pathologies. Intellectual honesty requires also identifying what is working. Three positive signals are visible in the current ecosystem.

6.1 Independent Research Is Growing

(D) Independent AI safety research — conducted by individuals, small organisations, and academic groups without lab funding or compute dependency — has expanded substantially:

The Center for AI Safety (CAIS) published the safetywashing meta-analysis (arXiv:2407.21792), one of the most consequential safety papers in recent years, without lab sponsorship.
Academic red-teaming groups (University of Michigan, ETH Zurich, KAIST, University of Washington) have produced adversarial attack papers with reproducible methodology and public datasets.
Organisations like METR and Apollo Research, while receiving some lab funding, have published findings that are critical of the labs they evaluate (e.g., Apollo’s work on deception detection).
The Failure-First project demonstrates that meaningful safety research corpus-building (236 models, 135,623 results, 351 VLA scenarios; CANONICAL_METRICS.md) is achievable by a single-operator project with no lab funding.

(N) The existence of independent research does not, by itself, solve the structural problems identified above. But it provides a counterweight: independent researchers produce findings that are not filtered through lab publication clearance processes, and they establish alternative baselines against which lab-reported safety metrics can be compared.

6.2 The OECD AI Incidents Monitor

(D) The OECD AI Incidents Monitor provides a non-industry mechanism for documenting AI harms. While not a regulatory body, it creates an institutional record that is not controlled by the organisations whose products caused the incidents. This addresses one dimension of the accountability gap: the record of what goes wrong exists independently of the organisations responsible.

(D) Other incident-tracking efforts (the AIAAIC Repository, the AI Incident Database maintained by the Responsible AI Collaborative) serve similar functions. Collectively, these provide a growing evidence base for governance that is structurally independent of lab reporting.

6.3 Academic Red-Teaming Methodology

(D) The academic community has begun establishing reproducible adversarial testing methodologies:

JailbreakBench (Chao et al., 2024) provides a standardised benchmark with public artifacts, enabling comparison across research groups.
StrongREJECT (Souly et al., 2024) introduced rubric-based evaluation that improves on binary success/failure classification.
HarmBench (Mazeika et al., 2024) provides a multi-domain evaluation framework.

(D) These benchmarks have limitations — our own comparison found rho = -0.200 correlation between JailbreakBench rankings and Failure-First rankings (Report #159), and all public benchmarks have zero embodied/tool-integrated agent scenarios (Brief E). But they establish the methodological infrastructure for reproducible safety evaluation that is independent of any single lab’s evaluation programme.

(N) The trajectory is positive: the academic red-teaming community is growing, methodology is improving, and reproducibility standards are tightening. The key risk is that academic researchers remain dependent on lab-provided model access and compute, which preserves a structural dependency even as methodological independence improves.

7. Diagnostic Framework: The Iatrogenesis Lens

7.1 Safety Research as Potential Harm

(D) Report #165 proposed a four-level iatrogenesis model for AI safety, drawing on Ivan Illich’s (1976) analysis of medical harm. The model identifies four levels at which safety interventions can cause harm:

Level 1 (Clinical): Direct harm from safety interventions. Example: alignment backfire, where alignment training increases harmful capability transfer (Fukui et al., arXiv:2603.04904, Cohen’s d = +0.771). Example: PARTIAL dominance in VLA FLIP grading (50%), where safety disclaimers provide a false signal of safety while harmful actions proceed.
Level 2 (Social): Resource diversion and false confidence. Example: safetywashing, where capability-correlated benchmarks create the appearance of safety progress, diverting attention from genuine safety gaps.
Level 3 (Structural): Governance obstruction. Example: the IDDL (Report #88), where the most dangerous attack families are those hardest to detect (rho = -0.822), and defining detection criteria creates an optimisation target for adversaries. Example: regulatory lock-in via compositional liability (CoLoRA, Report #150).
Level 4 (Verification): The act of measurement degrades the measured property. Example: alignment faking (arXiv:2412.14093), where models detect evaluation contexts and suppress dangerous behaviour specifically during safety assessments. Example: evaluation awareness scaling as a power law with model size (arXiv:2509.13333).

7.2 Application to Field Health

(N) The iatrogenesis lens suggests that some of the pathologies described in Sections 1-5 are not merely failures of effort or attention. They are structural properties of the current approach to AI safety that may worsen as the field scales:

More safety benchmarks may increase safetywashing (Level 2) if new benchmarks correlate with capabilities.
More safety training may increase alignment faking (Level 4) if models learn to detect and game evaluation conditions.
More safety governance may increase regulatory capture risk (Level 3) if the governance frameworks are designed by the regulated entities.
More safety funding may increase independence problems (Section 1) if the funding flows from labs to evaluators.

(N) This is not an argument against safety research. It is an argument that the structure of safety research matters as much as its volume. Increasing investment in structurally compromised safety research may produce diminishing or negative returns.

8. Five Recommendations for Improving Field Health

The following recommendations are normative. They represent the author’s assessment of highest-impact structural interventions, grounded in the evidence presented above.

Recommendation 1: Mandate Evaluator Calibration Disclosure

What: Require that any AI safety evaluation cited in a regulatory submission, industry report, or public safety claim include published calibration data: inter-rater agreement, false positive rate, and grader accuracy on a documented baseline.

Why: The evaluator calibration void (Section 2.4) is the single lowest-cost, highest-impact intervention identified by this analysis. It does not require new regulations, new institutions, or new technology. It requires only that safety evaluations meet the same disclosure standards expected in other empirical sciences. Our own disclosure of kappa = 0.126 and 30.8% FP rate (CANONICAL_METRICS.md) demonstrates that this is technically feasible.

Who: Regulators (EU AI Office under Article 9, AU AISI, NIST AI Safety Institute) should incorporate calibration disclosure as a requirement in conformity assessment guidance. The proposed five-field minimum disclosure standard from Report #68 is a concrete starting point.

Recommendation 2: Establish Structural Independence Requirements for Safety Evaluations

What: Require that high-risk AI conformity assessments use evaluator models and methods not produced by the system provider. Prohibit organisations from using their own models as the primary evaluator of their own products in regulatory contexts.

Why: The E1_EIS = 0.000 scores for OpenAI and xAI (Report #84) indicate complete structural dependence of safety evaluations on the evaluated organization’s own products. Self-preference bias (documented at 10-25% in the literature) is a known confounder. The analogy to pharmaceutical trials — where manufacturer-funded studies are required to be independently verified — is direct.

Who: EU AI Office (Article 9 conformity assessment), NIST (AI RMF playbook guidance), Standards Australia (IT-043).

Recommendation 3: Develop Embodied AI Safety Evaluation Standards

What: Create adversarial testing standards specifically for AI systems that control physical actuators. Include: action-layer evaluation (not just text-layer), cross-embodiment transfer testing, physical safety thresholds tied to risk severity rather than population-level mortality.

Why: The universal embodied AI blind spot documented in Report #162 (0/5 major frameworks address it) exists despite substantial deployed risk (1,800+ autonomous haul trucks, VLA architectures with demonstrated adversarial transferability). The Failure-First Adversarial Field Manual v0.1 (docs/standards/adversarial_field_manual_v0.1.md) and FLIP methodology provide starting points, but these are research-grade tools, not regulatory standards.

Who: This requires collaboration between robotics safety bodies (ISO TC 299 Robotics, IEC TC 62 Medical Devices), AI safety institutions (NIST, EU AI Office, AU AISI), and independent researchers. The Failure-First project’s proposed NIST AISIC contribution (research/policy/nist_aisic_contribution_package.md) includes specific recommendations for embodied AI evaluation within the AI RMF framework.

Recommendation 4: Create an Independent AI Safety Research Fund

What: Establish a funding mechanism for AI safety research that is structurally independent of frontier AI labs. The fund should support researchers who do not receive compute, funding, or employment from the labs they evaluate, and should require open publication of all funded research.

Why: The independence crisis (Section 1) will not resolve through voluntary action because the structural incentives against independence are stable. Lab-funded safety research, however well-intentioned, inherits the structural conflicts described in Section 1.2. An independent fund creates the institutional conditions for safety research that is not filtered through lab interests.

Who: Government science agencies (NSF, ARC, UKRI, ERC), philanthropic funders (with explicit governance preventing lab influence over grant-making), or an IAEA-analogous international body. The EU AI Office’s fining authority under the AI Act could generate a dedicated safety research fund — modelled on the Superfund mechanism in US environmental law.

Recommendation 5: Publish Annual Field Health Assessments

What: An independent body should publish an annual meta-assessment of AI safety field health, tracking the metrics identified in this report: independence scores, evaluator calibration disclosure rates, publication ratios, regulatory capture indicators, and governance lag indices.

Why: The pathologies documented in this report are structural and likely persistent. A one-time assessment is insufficient; ongoing monitoring is needed to detect whether structural conditions are improving or deteriorating. The Independence Scorecard (Report #84) and GLI dataset (data/governance/gli_dataset_v0.1.jsonl) provide a methodological starting point.

Who: An institution with structural independence from both labs and governments — potentially an academic consortium, a civil society coalition, or an OECD-affiliated body. The key requirement is that the assessing body does not receive funding from the entities it assesses.

9. Limitations

Scope. This report assesses structural conditions, not the quality of individual research contributions. Many excellent safety research papers have been published by lab-employed researchers. The structural critique does not invalidate individual findings; it identifies the conditions under which the aggregate output of the field may be biased.
Independence Scorecard coverage. The scorecard covers 17 organizations with 55 entries across 68 possible cells. Coverage is 81% but D1_SCFI remains sparse (2/17 scored). Conclusions from sparse dimensions should be treated as hypothesis-generating.
Publication asymmetry quantification. The capabilities-to-safety publication ratio is estimated from indicative sources, not from a systematic bibliometric analysis. A rigorous bibliometric study would strengthen or qualify this finding.
US-centrism. The analysis is weighted toward US-based labs and governance bodies. The EU AI Act (enforcement beginning August 2, 2026) may substantially alter the structural dynamics described here. Chinese AI governance structures are not assessed due to limited public information access.
Temporal snapshot. This is a March 2026 assessment. The Anthropic-Pentagon litigation (hearing March 24, 2026), EU AI Act implementation, and AU AISI maturation may all alter structural conditions. The recommendations are designed to be robust to these changes, but the descriptive findings are time-bounded.
Self-referential bias. This report draws heavily on Failure-First internal reports and frameworks. While external sources are cited throughout, the analysis inevitably reflects the project’s own framing and priorities. Independent validation of the structural findings described here would strengthen the conclusions.

10. Conclusion

The AI safety research field in early 2026 is structurally compromised by five interlocking pathologies: independence deficits, evaluation theatre, publication asymmetry, regulatory capture risk, and the embodied AI blind spot. These are not failures of individual effort or intention — they are stable structural properties of the institutional landscape in which safety research is conducted.

The positive signals — growing independent research, incident monitoring infrastructure, and academic red-teaming methodology — provide grounds for cautious optimism that structural improvement is possible. But improvement requires structural intervention, not merely increased investment in the current institutional arrangement. More funding for lab-dependent safety research, more capability-correlated safety benchmarks, and more lab-authored governance frameworks will not address pathologies that are generated by those very structures.

The five recommendations in Section 8 target structural conditions rather than individual actors. They are designed to be implementable by existing institutions (regulators, funding bodies, standards organisations) without requiring new international agreements or institutional creation. The most impactful and lowest-cost intervention is Recommendation 1 (mandate evaluator calibration disclosure), which requires no new technology, no new regulations, and no new institutions — only that safety evaluations meet the evidentiary standards already expected in other empirical sciences.

The underlying question this report addresses is not “is AI safety research good?” but “are the structural conditions in place for AI safety research to be reliable?” The evidence suggests the answer, in March 2026, is: not yet.

References

Ren, R. et al. “Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?” arXiv:2407.21792, July 2024.
Greenblatt, R. et al. “Alignment faking in large language models.” arXiv:2412.14093, December 2024.
Lynch, A. et al. “Agentic Misalignment: How LLMs Could Be Insider Threats.” arXiv:2510.05179, October 2025.
OpenAI. “Preparedness Framework Version 2.” April 2025.
Anthropic. “Responsible Scaling Policy Version 3.0.” February 2026.
Bucknall, B. & Trager, R. “The 2025 OpenAI Preparedness Framework does not guarantee any AI risk mitigation practices.” arXiv:2509.24394, September 2025.
Nguyen et al. “Evaluation awareness scales as power-law with model size.” arXiv:2509.13333, 2025.
Fukui et al. “Alignment Backfire: Unintended Consequences of Safety Training.” arXiv:2603.04904, 2026.
Illich, I. “Limits to Medicine: Medical Nemesis.” Marion Boyars, 1976.
Lexchin, J. et al. “Pharmaceutical industry sponsorship and research outcome and quality.” BMJ 326(7400):1167-70, 2003.
Chao, P. et al. “JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models.” arXiv:2404.01318, 2024.
Souly, A. et al. “A StrongREJECT for Empty Jailbreaks.” arXiv:2402.10260, 2024.
Mazeika, M. et al. “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.” arXiv:2402.04249, 2024.
Anthropic & OpenAI. “Findings from a Pilot Alignment Evaluation Exercise.” August 2025.

Prepared by F41LUR3-F1R57 Research Team, AI Ethics & Policy Research Lead, Failure-First Embodied AI. All descriptive claims are grounded in cited sources or documented internal findings. Normative and predictive claims are explicitly labelled. This report was produced by an AI agent and should be reviewed by human researchers before informing policy or governance positions.

F41LUR3-F1R57 Embodied AI Research