Evaluator Calibration Disclosure — A Minimum Standard for Automated Safety Grading | Research | Failure-First

Adrian Wedd

Report 68 Research — Empirical Study 2026-03-11

Executive Summary

This report proposes a minimum disclosure standard for automated evaluators used in AI safety benchmarks. The proposal is motivated by the finding that AI safety benchmark results are sensitive to evaluator choice (Report #61: accuracy ranges from 15% to 95%), evaluator training data (Report #65: StrongREJECT’s Gemma 2B trained on GPT-4 Turbo labels), and evaluator bias (Report #65: self-preference, position, and verbosity biases). Despite this sensitivity, no published benchmark consistently discloses the full identity, version, limitations, and potential conflicts of interest of its automated evaluator.

Normative claim: The absence of evaluator disclosure in AI safety benchmarks is analogous to publishing clinical trial results without identifying the assay laboratory or disclosing its accreditation status. It should be unacceptable in a field that claims scientific rigour.

Predictive claim: As regulatory frameworks adopt or reference safety benchmarks for compliance assessment (EU AI Act Article 9, anticipated US frameworks), the demand for evaluator transparency will increase. A voluntary disclosure standard adopted now can shape regulatory requirements later, ensuring they are technically sound rather than bureaucratically arbitrary.

1. Current State of Evaluator Disclosure

1.1 Survey of Major Benchmarks

Benchmark	Evaluator Disclosed?	Version Pinned?	Training Data Disclosed?	Bias Documented?	Conflict Disclosed?
StrongREJECT	Yes (GPT-4o-mini + Gemma 2B)	Partial	Yes (GPT-4 Turbo labels)	Yes (self-reports bias analysis)	No
HarmBench	Yes (Llama 2 13B ft)	Yes	Partial	No	No (Llama not a competitor)
JailbreakBench	Yes (GPT-4-class)	No (version not pinned)	N/A	No	No
AILuminate (MLCommons)	Yes (ensemble)	Partial	Partial	No	No
WildGuard	Yes (AllenAI ft)	Yes	Yes	No	No (AllenAI non-profit)
F41LUR3-F1R57 FLIP	Yes (deepseek-r1:1.5b)	Yes	N/A (zero-shot)	Yes (Report #61, #250)	N/A

Descriptive observation: The benchmark community has achieved partial disclosure — most benchmarks name their evaluator. But version pinning, training data provenance, bias documentation, and conflict-of-interest disclosure are inconsistent. No benchmark meets all five disclosure categories.

1.2 The Gap

The most significant gap is evaluator-target conflict disclosure. When a benchmark uses a GPT-4 variant to evaluate both OpenAI’s models and its competitors’ models, the self-preference bias (10-25%, Yan et al. 2024) is a relevant consideration that should be disclosed alongside results. No benchmark currently does this.

The second gap is version pinning. When benchmarks specify “GPT-4” without a version suffix (e.g., gpt-4-0613 vs gpt-4-turbo-2024-04-09), longitudinal comparisons become unreliable because OpenAI updates the model behind the “GPT-4” identifier. JailbreakBench and ad hoc research evaluations are particularly affected.

2. Proposed Minimum Disclosure Standard

2.1 Required Fields

Every published safety benchmark result that uses an automated evaluator should disclose:

Field 1: Evaluator Identity

Full model name and version (e.g., “gpt-4o-2024-08-06”, not “GPT-4”)
If fine-tuned: base model, fine-tuning data source, and fine-tuning objective
If ensemble: all component models and aggregation method

Field 2: Evaluator Provenance

Who produced the evaluator model
If fine-tuned: who performed the fine-tuning and what labels were used
If labels were generated by another model: identify that model (e.g., “fine-tuned on labels generated by GPT-4 Turbo”)

Field 3: Known Limitations

Documented biases (position, verbosity, self-preference, PARTIAL default)
Accuracy on validation set (if available)
Task types where the evaluator performs poorly
Inter-evaluator agreement metrics (if multiple evaluators used or compared)

Field 4: Conflict of Interest Statement

Whether the evaluator vendor produces any model being evaluated
Whether the evaluator vendor has commercial relationships with any model vendor being evaluated
Whether the evaluator was trained on data from any model being evaluated

Field 5: Longitudinal Stability

Whether the evaluator version is pinned for the benchmark’s lifetime
How evaluator updates are handled (re-run all evaluations? Append with version tag?)
Whether historical results are re-evaluated when the evaluator changes

2.2 Disclosure Format

The disclosure should be machine-readable (JSON or YAML) and included in the benchmark’s methodology section or a dedicated evaluator card. Example:

evaluator:
  name: "gpt-4o-2024-08-06"
  type: "LLM-as-judge"
  vendor: "OpenAI"
  fine_tuned: false
  version_pinned: true
  pin_date: "2024-08-06"
  known_biases:
    - type: "self-preference"
      magnitude: "10-25% on own outputs"
      source: "Yan et al. 2024, arXiv:2410.21819"
    - type: "position"
      magnitude: "up to 40% inconsistency"
      source: "Zheng et al. 2023"
  conflict_of_interest:
    evaluator_vendor_produces_targets: true
    targets_affected: ["gpt-4o", "o1", "o3", "gpt-4o-mini"]
    statement: >
      OpenAI produces both the evaluator model and several models
      under evaluation. Self-preference bias may inflate safety
      scores for OpenAI models.
  inter_evaluator_agreement:
    comparison_evaluator: "deepseek-r1:1.5b"
    metric: "Cohen's kappa"
    value: 0.245
    n: 942

2.3 Adoption Pathway

Phase 1 (Voluntary). Publish the standard as an open specification. Engage StrongREJECT, HarmBench, and JailbreakBench maintainers. Failure-First adopts first (we already disclose most of these fields in Reports #61 and #65).

Phase 2 (Community norm). Propose as a recommended practice in the NeurIPS/ICML benchmark track submission guidelines. Reviewers encouraged (not required) to check for evaluator disclosure.

Phase 3 (Regulatory requirement). Submit to EU AI Office as input to harmonised standards development for AI Act Article 9 conformity assessment. Submit to NIST AI RMF as a recommended practice for AI safety evaluation transparency.

3. Objections and Responses

3.1 “This adds bureaucratic overhead to benchmark maintenance”

Response: The disclosure fields require approximately 20 lines of YAML. The information is already known to benchmark maintainers — it merely needs to be written down. The one-time cost of documentation is small relative to the ongoing cost of producing benchmark results that readers cannot properly interpret.

3.2 “Self-preference bias has not been conclusively demonstrated at safety-relevant magnitudes”

Response: This is a fair point. The 10-25% self-preference estimate comes from general-purpose evaluation tasks, not specifically safety evaluation. Whether the bias transfers to safety judgment is an empirical question. However, the disclosure standard does not claim the bias exists at any particular magnitude — it requires disclosure of known biases so readers can form their own assessment. The absence of evidence of safety-specific bias is itself information worth disclosing.

3.3 “Evaluator diversity is impractical — GPT-4 is the best available judge”

Response: The standard does not require evaluator diversity. It requires disclosure. If GPT-4 is the only evaluator and the benchmark maintainer believes this is justified, the disclosure should say so, along with the conflict-of-interest statement. Transparency about monoculture is preferable to invisible monoculture.

4. Connection to the Four-Layer Model

The evaluator calibration disclosure standard is a Layer 0 safeguard. It addresses the evaluation infrastructure vulnerability identified in Report #67 by making Layer 0 properties visible:

Accuracy failure (L0 subtype 1): Addressed by Field 3 (known limitations, validation accuracy)
Bias failure (L0 subtype 2): Addressed by Field 3 (documented biases) and Field 4 (conflict of interest)
Process failure (L0 subtype 3): Addressed by Field 5 (longitudinal stability, version pinning)

The standard does not prevent Layer 0 failures. It makes them visible, which enables downstream consumers of benchmark results to adjust their confidence appropriately.

5. Self-Audit: Applying the Standard to Failure-First FLIP

The credibility of proposing a disclosure standard depends on applying it to our own methodology first. Below is the Failure-First FLIP evaluator disclosure card, honestly completed.

5.1 FLIP Evaluator Disclosure Card

evaluator:
  name: "deepseek-r1:1.5b"
  type: "LLM-as-judge (zero-shot backward inference)"
  vendor: "DeepSeek (open-weight, locally hosted via Ollama)"
  fine_tuned: false
  version_pinned: true
  pin_date: "2026-02-15"
  methodology: >
    FLIP (Failure-Layer Inference Protocol) uses backward inference:
    the evaluator reads the model response and infers what instruction
    would have produced it, then judges whether that inferred instruction
    constitutes harmful compliance. This differs from forward evaluation
    (does the response match a rubric?) by testing whether the response
    is functionally equivalent to complying with the harmful request.
  known_biases:
    - type: "PARTIAL default"
      magnitude: "qwen3:1.7b defaults to PARTIAL in 58% of verdicts"
      source: "Report #61, Issue #250"
      mitigation: "qwen3:1.7b retired as grader; deepseek-r1:1.5b used instead"
    - type: "accuracy ceiling"
      magnitude: "~70% on 5-category verdict task"
      source: "Report #61"
      mitigation: "Cross-model validation where possible; human audit sampling"
    - type: "ERROR rate on long traces"
      magnitude: "44% ERROR rate on multi-turn traces >2000 tokens"
      source: "Sprint-24 multi-turn batch 2"
      mitigation: "Exclude ERROR verdicts from ASR calculation; report ERROR rate"
  known_failures:
    - description: >
        qwen3:1.7b was deployed as a batch grader for 10,944 results
        before validation revealed 15% accuracy (n=20 audit sample).
        This is the canonical Layer 0 accuracy failure: the grading
        infrastructure produced systematically wrong verdicts that
        entered the project's metrics pipeline. Issue #250 tracks
        remediation.
      impact: "10,944 results carry unreliable verdicts"
      detection_method: "Post-hoc human audit sampling"
      detection_latency: "Multiple sprint sessions"
    - description: >
        Verification hallucination (Report #66): CANONICAL_METRICS.md
        reported 17,311 LLM-graded results when the DB contained 10,944
        (58% inflation). The error propagated across multiple agent
        sessions because each session verified the number against the
        previous documentation rather than re-querying the database.
      impact: "Stale metrics in 17+ downstream files"
      detection_method: "Operator direct DB query"
      detection_latency: "Multiple sprint sessions"
  conflict_of_interest:
    evaluator_vendor_produces_targets: false
    statement: >
      DeepSeek produces both the evaluator (deepseek-r1:1.5b) and one
      target model (DeepSeek R1 671B). However, the evaluator is a 1.5B
      distilled variant with substantially different capabilities, and
      self-preference bias at this scale differential has not been
      established. We disclose this relationship but assess the conflict
      risk as low.
  inter_evaluator_agreement:
    - comparison: "heuristic keyword classifier"
      metric: "Cohen's kappa"
      value: 0.245
      n: 942
      interpretation: "Fair agreement (Landis & Koch). Heuristic COMPLIANCE is 88% wrong; heuristic REFUSAL is 95% correct."
    - comparison: "deepseek-r1:1.5b vs qwen3:1.7b"
      metric: "Scenario-level agreement on VLA traces"
      value: 0.32
      n: 58
      interpretation: "Low agreement. Per-model FLIP ASR converged (72.4%) but individual trace verdicts diverge."
  longitudinal_stability:
    version_pinned: true
    update_policy: >
      When the evaluator model is updated, all historical results
      will be re-evaluated and the version change documented.
      Currently no re-evaluation has been performed because the
      evaluator version has not changed since deployment.
    historical_re_evaluation: "Not yet needed; policy defined"

5.2 Self-Audit Findings

Applying the five-field standard to our own methodology reveals three gaps:

We pass Fields 1-2 (Identity, Provenance). FLIP has always named the evaluator model and its provenance.
We partially pass Field 3 (Known Limitations). The 15% qwen3:1.7b accuracy finding and the verification hallucination are documented in Reports #61 and #66. However, these disclosures were distributed across multiple reports rather than consolidated in one evaluator card. The standard requires a single, findable disclosure location.
We partially pass Field 4 (Conflict of Interest). The DeepSeek evaluator/target relationship was not previously documented as a conflict of interest. The risk assessment is “low” but the absence of prior documentation is itself a gap.
We partially pass Field 5 (Longitudinal Stability). Version pinning exists in practice but the re-evaluation policy was implicit rather than written. The standard forces us to make the policy explicit.

Net assessment: Failure-First FLIP meets 2 of 5 fields fully and 3 of 5 partially. No benchmark surveyed in Section 1.1 meets more than 3 of 5. This confirms that the standard identifies real gaps even in methodologies that were designed with transparency in mind.

6. Concrete Recommendations

6.1 For Benchmark Maintainers

Recommendation 1: Publish an evaluator card. Every benchmark that uses an automated evaluator should publish a machine-readable evaluator card (YAML or JSON) alongside its results. The card should contain the five fields specified in Section 2.1. A template is provided in Section 5.1 above.

Recommendation 2: Report evaluator accuracy on a validation sample. Before deploying an automated evaluator at scale, validate its accuracy on a held-out sample of at least 50 traces with human ground truth. Report the accuracy, the sample size, and the task breakdown. Our qwen3:1.7b finding (15% accuracy discovered only after 10,944 results were graded) demonstrates the cost of skipping this step.

Recommendation 3: Report inter-evaluator agreement when claiming robustness. If a benchmark claims that its results are robust to evaluator choice, it should report inter-evaluator agreement (Cohen’s kappa or equivalent) with at least one alternative evaluator. Our finding of kappa=0.245 between heuristic and LLM grading demonstrates that nominal agreement (both label the same proportion of traces as “attack success”) can mask systematic disagreement at the individual trace level.

Recommendation 4: Disclose evaluator-target vendor overlap. When the evaluator vendor produces any model under evaluation, this should be stated explicitly in the benchmark methodology. The disclosure should cite the best available estimate of self-preference bias magnitude and note whether the bias has been measured specifically for safety evaluation tasks (as opposed to general-purpose evaluation).

6.2 For Regulators

Recommendation 5: Require evaluator disclosure in conformity assessment. When regulatory frameworks (EU AI Act Article 9, NIST AI RMF) reference safety benchmarks for compliance assessment, the conformity assessment procedure should require disclosure of the evaluator identity, version, and known limitations. Without this, regulated entities can select favourable benchmarks and evaluators to demonstrate compliance.

Recommendation 6: Require evaluator diversity for high-risk classifications. For AI systems classified as high-risk (EU AI Act Annex III), conformity assessment should require evaluation by at least two independent evaluators from different vendor families. The OBLITERATUS community evaluation dataset (30,238 records across 13 abliteration methods) demonstrates that safety evaluation at scale is already occurring in the open community without any evaluator disclosure norms. Governance frameworks should establish these norms before they become entrenched.

Recommendation 7: Establish evaluator accreditation criteria. By analogy with clinical laboratory accreditation, regulators should define minimum accuracy thresholds for automated safety evaluators. Our data suggests that an evaluator with accuracy below 50% (random baseline on binary tasks) should not be used for compliance assessment. An evaluator with accuracy below 70% should be flagged as “uncalibrated” and its results marked accordingly.

6.3 For Researchers

Recommendation 8: Treat evaluator choice as a variable, not a constant. When reporting safety evaluation results, researchers should acknowledge that the results are conditional on the evaluator used. Where feasible, report results under multiple evaluators to bound the sensitivity. Our VLA FLIP results show per-model ASR converging at 72.4% across two evaluators while scenario-level agreement is only 32% — the aggregate statistic is stable but individual judgments are not.

Recommendation 9: Audit deployed evaluators periodically. Evaluator accuracy should be re-validated periodically, especially after model updates or task distribution shifts. Our verification hallucination finding demonstrates that evaluation process integrity degrades over time without active verification against primary data sources.

7. YAML Schema Specification

The following JSON Schema validates evaluator disclosure cards:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Evaluator Disclosure Card",
  "description": "Minimum disclosure standard for automated safety evaluators (Report #68, F41LUR3-F1R57)",
  "type": "object",
  "required": ["evaluator"],
  "properties": {
    "evaluator": {
      "type": "object",
      "required": ["name", "type", "vendor", "fine_tuned", "version_pinned", "conflict_of_interest"],
      "properties": {
        "name": {
          "type": "string",
          "description": "Full model name and version (e.g., gpt-4o-2024-08-06)"
        },
        "type": {
          "type": "string",
          "enum": ["LLM-as-judge", "fine-tuned-classifier", "heuristic", "ensemble", "human", "hybrid"],
          "description": "Evaluator methodology class"
        },
        "vendor": {
          "type": "string",
          "description": "Who produced the evaluator model"
        },
        "fine_tuned": {
          "type": "boolean"
        },
        "fine_tuning_details": {
          "type": "object",
          "properties": {
            "base_model": { "type": "string" },
            "label_source": { "type": "string", "description": "Source of training labels (e.g., 'GPT-4 Turbo generated labels')" },
            "objective": { "type": "string" }
          }
        },
        "version_pinned": {
          "type": "boolean"
        },
        "pin_date": {
          "type": "string",
          "format": "date"
        },
        "methodology": {
          "type": "string",
          "description": "Brief description of evaluation methodology"
        },
        "known_biases": {
          "type": "array",
          "items": {
            "type": "object",
            "required": ["type", "magnitude", "source"],
            "properties": {
              "type": { "type": "string" },
              "magnitude": { "type": "string" },
              "source": { "type": "string" },
              "mitigation": { "type": "string" }
            }
          }
        },
        "known_failures": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "description": { "type": "string" },
              "impact": { "type": "string" },
              "detection_method": { "type": "string" },
              "detection_latency": { "type": "string" }
            }
          }
        },
        "conflict_of_interest": {
          "type": "object",
          "required": ["evaluator_vendor_produces_targets", "statement"],
          "properties": {
            "evaluator_vendor_produces_targets": { "type": "boolean" },
            "targets_affected": {
              "type": "array",
              "items": { "type": "string" }
            },
            "statement": { "type": "string" }
          }
        },
        "inter_evaluator_agreement": {
          "type": "array",
          "items": {
            "type": "object",
            "required": ["comparison", "metric", "value", "n"],
            "properties": {
              "comparison": { "type": "string" },
              "metric": { "type": "string" },
              "value": { "type": "number" },
              "n": { "type": "integer" },
              "interpretation": { "type": "string" }
            }
          }
        },
        "longitudinal_stability": {
          "type": "object",
          "required": ["version_pinned", "update_policy"],
          "properties": {
            "version_pinned": { "type": "boolean" },
            "update_policy": { "type": "string" },
            "historical_re_evaluation": { "type": "string" }
          }
        }
      }
    }
  }
}

8. Connection to OBLITERATUS Community Evaluations

The OBLITERATUS telemetry dataset provides an instructive case study for evaluator disclosure norms. The community has produced 30,238 evaluation records across 13 abliteration methods, evaluating safety guardrail removal on models ranging from distilgpt2 to Qwen3-4B, using H200 and A100 GPU infrastructure. This represents a large-scale, distributed safety evaluation effort — yet it operates entirely outside any evaluator disclosure framework.

The OBLITERATUS evaluation pipeline measures refusal_rate, perplexity, coherence, and KL divergence. These metrics evaluate the success of safety removal, not safety itself. But the same methodological concerns apply: what evaluator determines whether a response constitutes a “refusal”? What biases does that evaluator carry? How is coherence measured, and by whom?

The absence of evaluator disclosure in community-driven safety research creates a double risk: the research community’s benchmarks may be biased (Section 1), and the adversarial community’s counter-benchmarks have no disclosure norms at all. The evaluator calibration disclosure standard should apply symmetrically to both contexts.

9. Next Steps

Publish the YAML schema as a standalone specification in the project’s schemas/ directory
Engage StrongREJECT, HarmBench, and JailbreakBench maintainers with a concrete proposal
Submit evaluator disclosure recommendations to EU AI Office (harmonised standards input)
Draft NeurIPS/ICML benchmark track reviewer guidance proposal
Create a public-facing blog post summarising the standard for failurefirst.org