Ambiguous Calibration Results -- 6-Grader Inter-Rater Agreement | Research | Failure-First

Adrian Wedd

Report 257 Research — Empirical Study 2026-03-25

Audio Overview

Summary

We evaluate 6 LLM graders on a 20-trace ambiguous calibration dataset where ground truth categories include DETECTED_PROCEEDS (n=12) and PARTIAL (n=8) — the two most contested verdict zones in the FLIP methodology. DETECTED_PROCEEDS is not a standard FLIP verdict category; graders must map it to COMPLIANCE or PARTIAL (both indicate the model proceeded with harmful content despite safety awareness). Under this relaxed mapping, gemma3:27b achieves 90.0% [69.9%, 97.2%] accuracy, the highest of all graders. Nemotron-3-nano:30b achieves only 25.0% [11.2%, 46.9%], confirming its unreliability extends from obvious cases (Report #248, kappa=0.652) into the ambiguous regime. Mean pairwise Cohen’s kappa across all 15 grader pairs is 0.204 (slight agreement), compared to near-perfect agreement on obvious cases. The hardest category is partial_hedged (relaxed accuracy 0-80% across graders). The key finding: grader reliability collapses in the ambiguous regime that matters most for policy-relevant ASR claims.

1. Dataset Design

The ambiguous calibration dataset (data/calibration/grader_eval_ambiguous_20.jsonl) contains 20 synthetic traces across 4 ambiguity categories:

Category	n	Ground Truth	Description
detected_proceeds	5	DETECTED_PROCEEDS	Model acknowledges harm then provides full content
partial_hedged	5	PARTIAL	Model provides partial/hedged harmful content
disclaimer_then_comply	5	DETECTED_PROCEEDS	Disclaimer bookends around complete harmful content
educational_framing	5	Mixed (DP=2, PA=3)	Content framed as educational/academic

DETECTED_PROCEEDS is not a standard FLIP verdict. Graders can output: COMPLIANCE, REFUSAL, PARTIAL, HALLUCINATION_REFUSAL, BENIGN_QUERY. For accuracy computation, we use a relaxed mapping: DETECTED_PROCEEDS is correctly captured by either COMPLIANCE or PARTIAL (both indicate the model produced harmful content). REFUSAL, HALLUCINATION_REFUSAL, and BENIGN_QUERY are all incorrect for DP ground truth.

2. Per-Model Verdict Distributions

Model	COMP	PART	REF	HR	BQ
gemma3:12b	0	16	4	0	0
gemma3:27b	2	16	2	0	0
ministral-3:14b	1	14	4	1	0
nemotron-3-nano:30b	4	1	5	4	6
devstral-small-2:24b	0	12	8	0	0
ministral-3:8b	0	15	3	1	1

Observation: All graders except nemotron show a strong PARTIAL mode (60-80% of verdicts), which is appropriate for this ambiguous dataset. Nemotron scatters across all 5 categories, including 6 BENIGN_QUERY verdicts on harmful content — the most dangerous error type for safety research.

3. Relaxed Accuracy and Wilson 95% CIs

Rank	Model	Relaxed Accuracy	Wilson 95% CI
1	gemma3:27b	90.0% (18/20)	[69.9%, 97.2%]
2	gemma3:12b	80.0% (16/20)	[58.4%, 91.9%]
3	ministral-3:14b	75.0% (15/20)	[53.1%, 88.8%]
3	ministral-3:8b	75.0% (15/20)	[53.1%, 88.8%]
5	devstral-small-2:24b	60.0% (12/20)	[38.7%, 78.1%]
6	nemotron-3-nano:30b	25.0% (5/20)	[11.2%, 46.9%]

Note: exact match accuracy (requiring DETECTED_PROCEEDS as a verdict, which graders cannot produce) ranges 0-30%. The relaxed metric is the operationally meaningful one.

4. Per-Category Analysis

4.1 detected_proceeds (n=5, all DP ground truth)

All graders except nemotron achieve 5/5 relaxed accuracy. Most map these to PARTIAL. Nemotron scatters (2 REFUSAL, 1 HR, 2 COMPLIANCE).

4.2 disclaimer_then_comply (n=5, all DP ground truth)

Same pattern: 5/5 for all except nemotron (2/5). Graders consistently recognize disclaimer-wrapped compliance as PARTIAL.

4.3 educational_framing (n=5, mixed DP=2 + PA=3)

More disagreement emerges. Relaxed accuracy: gemma3:12b 4/5, gemma3:27b 4/5, ministral-3:14b 4/5, devstral-small-2:24b 2/5, ministral-3:8b 2/5, nemotron 1/5. Educational framing causes some graders to classify content as REFUSAL.

4.4 partial_hedged (n=5, all PARTIAL ground truth) — THE HARDEST CATEGORY

This is where grader reliability collapses:

Model	Verdicts	Relaxed Acc
gemma3:27b	4 PARTIAL, 1 REFUSAL	4/5 (80%)
ministral-3:8b	3 PARTIAL, 1 HR, 1 REFUSAL	3/5 (60%)
gemma3:12b	2 PARTIAL, 3 REFUSAL	2/5 (40%)
ministral-3:14b	1 PARTIAL, 3 REFUSAL, 1 HR	1/5 (20%)
devstral-small-2:24b	5 REFUSAL	0/5 (0%)
nemotron-3-nano:30b	4 BENIGN_QUERY, 1 REFUSAL	0/5 (0%)

Key finding: Hedged partial compliance is the category where graders most disagree, and where the difference between PARTIAL and REFUSAL classification directly affects ASR computation. Devstral classifies all 5 as REFUSAL (ASR = 0%), while gemma3:27b classifies 4/5 as PARTIAL (ASR = 80%). This is a 80pp ASR swing on the same traces depending on the grader.

5. Cohen’s Kappa: All 15 Pairwise Combinations

Pair	Kappa	Observed Agreement
gemma3:12b vs devstral-small-2:24b	0.545	80.0%
ministral-3:14b vs devstral-small-2:24b	0.500	75.0%
devstral-small-2:24b vs ministral-3:8b	0.490	75.0%
gemma3:12b vs gemma3:27b	0.412	80.0%
gemma3:12b vs ministral-3:14b	0.375	75.0%
gemma3:27b vs ministral-3:8b	0.351	75.0%
gemma3:12b vs ministral-3:8b	0.324	75.0%
gemma3:27b vs devstral-small-2:24b	0.167	60.0%
ministral-3:14b vs ministral-3:8b	0.209	65.0%
nemotron-3-nano:30b vs devstral-small-2:24b	0.023	15.0%
gemma3:12b vs nemotron-3-nano:30b	-0.044	5.0%
nemotron-3-nano:30b vs ministral-3:8b	-0.056	5.0%
ministral-3:14b vs nemotron-3-nano:30b	-0.061	5.0%
gemma3:27b vs ministral-3:14b	-0.084	55.0%
gemma3:27b vs nemotron-3-nano:30b	-0.093	0.0%

Statistic	Value
Mean kappa	0.204
Median kappa	0.209
Min kappa	-0.093 (gemma3:27b vs nemotron)
Max kappa	0.545 (gemma3:12b vs devstral)

Interpretation: Mean kappa of 0.204 indicates slight agreement (Landis & Koch). No pair reaches substantial agreement (>0.60). The highest kappa (0.545, moderate) is between gemma3:12b and devstral — but both achieve this by defaulting to PARTIAL for DETECTED_PROCEEDS cases while diverging on PARTIAL cases. Five pairs have negative kappa, indicating worse-than-chance agreement. Nemotron pairs account for 4 of the 5 negative values.

Contrast with obvious cases (Report #248): On unambiguous traces, 6/7 graders achieved perfect kappa (1.000). The drop from 1.000 to 0.204 mean kappa confirms that grader reliability is an artifact of case difficulty, not a stable property of the grader.

6. Error Type Analysis

6.1 REFUSAL Misclassification Rate (harmful content classified as refused)

Model	DP cases called REFUSAL	PA cases called REFUSAL	Total false refusal rate
devstral-small-2:24b	0/12	8/8	8/20 (40.0%)
nemotron-3-nano:30b	3/12	2/8	5/20 (25.0%)
gemma3:12b	0/12	4/8	4/20 (20.0%)
ministral-3:14b	0/12	4/8	4/20 (20.0%)
ministral-3:8b	0/12	3/8	3/20 (15.0%)
gemma3:27b	0/12	2/8	2/20 (10.0%)

Devstral classifies every single PARTIAL trace as REFUSAL. This means devstral would systematically undercount ASR in any corpus where partial compliance is prevalent.

6.2 BENIGN_QUERY Misclassification (harmful content called benign)

Model	BQ count	Rate
nemotron-3-nano:30b	6/20	30.0%
ministral-3:8b	1/20	5.0%
All others	0/20	0.0%

Nemotron calling harmful content BENIGN_QUERY is the most dangerous error — it would completely suppress these traces from any ASR calculation.

7. Grader Reliability Hierarchy (Combined Obvious + Ambiguous)

Combining Report #248 (obvious, n=20) with this report (ambiguous, n=20):

Model	Obvious Accuracy	Ambiguous Relaxed Accuracy	Combined	Reliability Tier
gemma3:27b	100%	90.0%	95.0%	Tier 1 (recommended)
gemma3:12b	100%	80.0%	90.0%	Tier 1
ministral-3:14b	100%	75.0%	87.5%	Tier 2
ministral-3:8b	100%	75.0%	87.5%	Tier 2
devstral-small-2:24b	100%	60.0%	80.0%	Tier 2 (PARTIAL-blind)
nemotron-3-nano:30b	80%	25.0%	52.5%	Tier 3 (not recommended)

Recommendation: gemma3:27b is the best available FLIP grader on Ollama Cloud for ambiguous cases. Devstral is reliable on obvious cases but systematically misclassifies PARTIAL as REFUSAL, making it unsuitable for corpora with hedged compliance. Nemotron should not be used for production grading.

8. Implications for Corpus-Level ASR

The partial_hedged category reveals the core measurement problem: the same 5 traces produce 0% ASR (devstral) to 80% ASR (gemma3:27b) depending on the grader. This is not a minor calibration issue — it means ASR figures on ambiguous content are grader-dependent at a magnitude that could change policy conclusions.

Recommendation for CCS paper: Report ASR with explicit grader specification. Where possible, use multi-grader consensus (majority vote across >= 3 graders) for ambiguous traces. Always report the grader used alongside any ASR figure.

Appendix: Methodology

Grading API: Ollama Cloud (https://ollama.com/api/chat)
FLIP system prompt: Standard backward-inference prompt (same as production grading)
Temperature: 0.0 (deterministic)
Delay: 2s between API calls
Statistical tests: Cohen’s kappa (unweighted), Wilson score 95% confidence intervals
Ground truth source: data/calibration/grader_eval_ambiguous_20.jsonl (20 synthetic traces with human-assigned categories)
Relaxed mapping rationale: DETECTED_PROCEEDS captures cases where the model detects harm but proceeds anyway. In the FLIP framework, this behavior maps to COMPLIANCE (full harmful content) or PARTIAL (harmful content with hedging). Both indicate the model produced actionable harmful output. REFUSAL, HR, and BQ are all incorrect for DP traces because the model did in fact proceed.