Evaluation Monoculture — The Structural Risk of GPT-4-as-Judge Dependency in AI Safety Benchmarks | Research | Failure-First

Adrian Wedd

Report 103 Research — Empirical Study 2026-03-11

Executive Summary

This brief surveys the structural risk created by the AI safety evaluation ecosystem’s dependence on a narrow set of evaluator models and methodologies. The dominant pattern across published safety benchmarks is LLM-as-judge using GPT-4-class models (GPT-4, GPT-4-turbo, GPT-4o, GPT-4o-mini). When the same evaluator model or model family is used across multiple benchmarks, systematic biases in that evaluator propagate invisibly: every benchmark inherits the same blind spots, and benchmark rankings may reflect evaluator preferences rather than genuine safety differences.

Descriptive claim: The major AI safety evaluation benchmarks use distinct evaluator architectures, creating partial diversity: StrongREJECT uses GPT-4o-mini (rubric-based) plus a fine-tuned Gemma 2B evaluator; HarmBench uses a fine-tuned Llama 2 13B classifier; JailbreakBench uses GPT-4-class models for evaluation. However, the broader research ecosystem defaults to GPT-4/GPT-4o as the LLM-as-judge when no benchmark-specific evaluator is specified. The training data used to fine-tune benchmark-specific evaluators (e.g., StrongREJECT’s Gemma 2B) was itself generated by GPT-4 Turbo, creating a dependency even where the deployed evaluator is nominally different.

Normative claim: When OpenAI controls the model most commonly used to evaluate the safety of all frontier AI systems including its own competitors’ products, this constitutes a structural conflict of interest that the AI safety community has not adequately addressed. This is not an accusation of deliberate bias; it is an observation that the incentive structure creates an unexamined dependency.

Predictive claim: As regulatory frameworks (EU AI Act, anticipated US state legislation, Australian VAISS guidelines) require automated safety evaluation, the demand for standardised evaluator models will increase. Without proactive evaluator diversity requirements, the ecosystem will converge further on a small number of evaluator models, amplifying monoculture risk.

1. Evaluator Model Usage Across Major Safety Benchmarks

1.1 Current Evaluator Landscape (Descriptive)

Benchmark	Evaluator Model	Evaluator Type	Training Data Source
StrongREJECT	GPT-4o-mini (rubric) + Gemma 2B (fine-tuned)	Rubric-based scoring + fine-tuned classifier	GPT-4 Turbo generated labels
HarmBench	Llama 2 13B (fine-tuned)	Fine-tuned binary classifier	Human-annotated + model-generated
JailbreakBench	GPT-4-class	LLM-as-judge	N/A (zero-shot prompting)
AILuminate (MLCommons)	Multiple models	Ensemble approach	Consortium-defined rubrics
WildGuard	AllenAI fine-tuned	Fine-tuned classifier	Expert-annotated
Failure-First (FLIP)	deepseek-r1:1.5b / qwen3:1.7b	LLM-as-judge (backward inference)	N/A (zero-shot prompting)

The landscape shows more diversity than a pure monoculture: HarmBench uses a Llama-based evaluator, WildGuard uses an AllenAI fine-tuned model, and StrongREJECT offers both a GPT-4o-mini rubric evaluator and a Gemma 2B fine-tuned version. This is a genuinely positive development.

However, two structural dependencies remain:

First, the training data dependency. StrongREJECT’s Gemma 2B evaluator was fine-tuned on labels generated by GPT-4 Turbo. This means the fine-tuned model inherits GPT-4 Turbo’s classification biases — it has learned to reproduce GPT-4 Turbo’s judgments, not independent human judgments. The evaluator architecture is diverse; the evaluator training signal is not.

Second, the default evaluator in ad hoc evaluations. Research groups that do not use a specific benchmark’s built-in evaluator typically default to GPT-4/GPT-4o for their own LLM-as-judge evaluations. This is the unmeasured monoculture: the evaluator usage in published papers, safety reports, and internal testing that does not use a formal benchmark. We do not have systematic data on this usage, but the frequency of “we used GPT-4 to evaluate responses” in published methodology sections suggests it is widespread.

1.2 Known Biases in LLM-as-Judge (Descriptive)

Published research has documented several systematic biases in LLM-as-judge evaluations:

Position bias. GPT-4 exhibits up to 40% inconsistency when evaluating A,B versus B,A orderings of the same content (Zheng et al., 2023; Shi et al., 2025, ACL/IJCNLP). Position bias varies significantly across tasks and judge models, and is strongly affected by the quality gap between solutions.

Verbosity bias. LLM judges prefer longer, more verbose responses regardless of substantive quality, with approximately 15% inflation attributable to length preference. This bias is an artifact of generative pretraining and RLHF reward signals that correlate length with quality.

Self-preference bias. GPT-4 favors its own outputs 10-25% more than comparable outputs from other models. Recent quantitative work (Yan et al., 2024; arXiv:2410.21819) demonstrates that self-preference correlates with lower perplexity on self-generated text — models prefer outputs that are more familiar to them.

Safety assessment bias. StrongREJECT found that most existing automated evaluators are “overly generous to jailbreak methods” — they systematically over-estimate jailbreak success. Only StrongREJECT and HarmBench were found to be unbiased; most other evaluators exhibited upward bias in ASR measurement.

1.3 What Happens When the Evaluator Changes (Descriptive)

The Failure-First project provides direct evidence of evaluator sensitivity:

Heuristic vs LLM grading: Cohen’s kappa = 0.245 (slight agreement beyond chance)
qwen3:1.7b vs deepseek-r1:1.5b on VLA traces: identical aggregate ASR (72.4%) but only 32% scenario-level agreement
Heuristic structural compliance vs FLIP-graded ASR on format-lock: 76.5% vs 47.1% (29.4pp gap)
qwen3:1.7b overall accuracy audit: 15% on general classification tasks

These are not edge cases. They represent the range of disagreement that occurs when the evaluator is changed within a single project evaluating the same traces. If the benchmark community’s evaluator diversity is limited, these disagreements are invisible — there is no second measurement to reveal that the first was biased.

2. Power Concentration Analysis

2.1 The Structural Conflict of Interest (Descriptive + Normative)

Descriptive claim: OpenAI produces the model (GPT-4/GPT-4o) most commonly used as the default LLM-as-judge in AI safety evaluations. OpenAI also produces frontier AI systems (GPT-4, GPT-4o, o1, o3) that are evaluated by those same safety evaluations. OpenAI is a direct commercial competitor to Anthropic, Google DeepMind, Meta AI, and others whose systems are evaluated using OpenAI’s model as the judge.

Normative claim: This constitutes a structural conflict of interest. The company that produces the evaluation tool has commercial incentives regarding the evaluation outcomes for both its own products and its competitors’ products. This does not require or imply deliberate manipulation — the self-preference bias documented in the literature (10-25% favoring own outputs) operates without intentional intervention. The concern is structural, not intentional.

Analogical reasoning: In financial markets, credit rating agencies that rate the securities of firms that pay them for ratings have a well-documented conflict of interest (Moody’s/S&P role in the 2008 financial crisis). The EU addressed this through Regulation 462/2013, requiring rating agency rotation and disclosure of conflicts. No equivalent regulation exists for AI safety evaluators.

2.2 The Pentagon Contract Context (Descriptive)

Recent developments sharpen the power concentration concern. As of February-March 2026:

Anthropic’s Pentagon contract negotiations collapsed over the company’s insistence on contractual prohibitions against use in mass surveillance and autonomous weapons systems
The Pentagon and President Trump designated Anthropic as a supply chain risk, ordering federal agencies to phase out Anthropic tools within six months
OpenAI subsequently secured a Pentagon contract for classified network deployment, with its own stated red lines (no mass domestic surveillance, no autonomous weapons direction, no social credit systems) framed as reflecting existing law rather than imposing new contractual constraints
OpenAI’s position — that existing law provides sufficient constraint — contrasts with Anthropic’s position that existing law has not kept pace with AI capabilities

Descriptive claim: In this context, OpenAI’s dominant position as the default AI safety evaluator acquires additional significance. A company that holds both the primary government AI contract and the primary evaluation tool used to assess safety across the ecosystem has structural influence over the AI safety narrative that no single entity should hold without independent oversight.

Normative claim: This is not a claim that OpenAI is acting in bad faith. It is a claim that the structural position is one that requires scrutiny regardless of intent. The question is not whether OpenAI would manipulate GPT-4-as-judge to favor its own products. The question is whether the AI safety ecosystem should be structured so that this question can be asked.

3. Implications for Embodied AI Safety Evaluation

3.1 The Actuator Layer Compounds the Problem (Descriptive)

The evaluation monoculture risk is amplified in embodied AI contexts by the actuator gap (Report #63). If all VLA safety evaluations use the same evaluator model or model family, and that evaluator has systematic biases (e.g., verbosity bias causing it to rate longer, more hedged responses as safer), then:

PARTIAL responses (textual hedge + structural compliance) may be systematically rated as refusals
The measured ASR across all benchmarks will be systematically lower than the true ASR
VLA systems deployed based on these safety assessments will have a higher true vulnerability than their assessed vulnerability
The actuator gap converts this measurement error into physical risk

The Failure-First project’s finding that heuristic classifiers have an 88% false positive rate on COMPLIANCE (AGENT_STATE, Mistake #21) is a specific instance of this dynamic. An ecosystem-wide evaluator bias in the same direction would produce the same error pattern at scale, without the cross-evaluator disagreement data needed to detect it.

3.2 No Benchmark Evaluates at the Actuator Layer (Descriptive)

Descriptive claim: JailbreakBench, HarmBench, StrongREJECT, and AILuminate all evaluate at the text output layer. None of them evaluate whether a model’s structured output (JSON, action trajectories, code) would produce harmful actions if executed by a downstream system. This is the actuator layer gap identified in Report #63, Section 3.2.

For text-only AI systems, text-layer evaluation is appropriate. For embodied AI systems, text-layer evaluation is necessary but insufficient. The evaluation monoculture risk is that the benchmarks driving safety standards are designed for a text-only paradigm and are being applied to embodied systems without adaptation.

4. Policy Recommendations

4.1 Evaluator Diversity Requirements (Normative)

For standards bodies (ISO/IEC, CEN/CENELEC, NIST):

Safety evaluations submitted for regulatory compliance should require evaluation by at least two independent evaluator models from different model families. “Independent” means: (a) produced by different organisations, (b) not fine-tuned on labels generated by the same source model, and (c) using different evaluation rubrics or methodologies.

This is analogous to the dual-auditor requirement in financial regulation (some jurisdictions require rotation of audit firms to prevent capture) and the multi-lab verification norm in pharmaceutical testing.

4.2 Evaluator Calibration Disclosure (Normative)

As proposed in Report #64, Section 3.1:

Any published AI safety evaluation should disclose evaluator identity, task-specific calibration data, inter-evaluator agreement, known failure modes, response-length sensitivity, and evaluator capability floor. This disclosure should be a mandatory component of EU AI Act conformity assessments and should be recommended in VAISS Guardrail 4 guidance.

4.3 Evaluator Independence Scoring (Normative)

The independence metrics framework (Report #54) should be extended to include an Evaluator Independence Score: the degree to which an organisation’s safety evaluations rely on evaluator models produced by commercial competitors. An organisation that evaluates its own products exclusively using its own evaluator model should receive a lower independence score than one that uses diverse, externally-produced evaluators.

4.4 Actuator-Layer Evaluation Standard (Normative)

For embodied AI systems, safety evaluations should include at minimum one evaluation at the actuator layer — assessing whether the model’s structured output would produce harmful actions if parsed by a downstream action decoder. The Failure-First FLIP methodology provides one approach (backward inference from model output to inferred instruction), but the field needs purpose-built actuator-layer evaluation tools.

5. Limitations

No systematic survey conducted. The claim about GPT-4-as-judge prevalence in ad hoc evaluations is based on familiarity with the published literature, not a systematic count. A proper survey of evaluator model usage across published safety papers (2024-2026) would provide stronger evidence. Issue #258 identifies this as future work.
Self-preference magnitude. The 10-25% self-preference bias figure comes from Yan et al. (2024). Whether this magnitude is sufficient to materially change benchmark rankings has not been systematically studied for safety-specific evaluations.
Benchmark-specific evaluators provide partial diversity. The landscape is not a pure monoculture — HarmBench (Llama 2 13B), WildGuard (AllenAI), and StrongREJECT (Gemma 2B) use distinct architectures. The monoculture risk is concentrated in ad hoc evaluations and in the training data dependency (fine-tuned evaluators trained on GPT-4 labels).
Conflict of interest claim is structural, not evidential. We have no evidence that OpenAI has deliberately manipulated GPT-4-as-judge to favor its products. The self-preference bias operates without deliberate intervention. The structural concern is that the dependency exists and is largely unexamined, not that it has been exploited.

References

Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023.
Shi, Z. et al. (2025). A Systematic Study of Position Bias in LLM-as-a-Judge. ACL/IJCNLP 2025.
Yan, L. et al. (2024). Self-Preference Bias in LLM-as-a-Judge. arXiv:2410.21819.
Souly, A. et al. (2024). A StrongREJECT for Empty Jailbreaks. arXiv:2402.10260.
Chao, P. et al. (2024). JailbreakBench: An Open Robustness Benchmark. NeurIPS 2024 D&B Track.
Mazeika, M. et al. (2024). HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. ICML 2024.
arXiv:2603.08091. Toward Robust LLM-based Judges: Taxonomic Bias Evaluation and Debiasing Optimization.

Prepared by Nyssa of Traken, AI Ethics & Policy Research Lead, Failure-First Embodied AI. Normative claims are explicitly labelled. Descriptive claims reference published sources or documented project measurements. Structural conflict of interest analysis follows standard power concentration methodology — it does not allege intentional misconduct.