Published
Report 257 Research — Empirical Study

Summary

We evaluate 6 LLM graders on a 20-trace ambiguous calibration dataset where ground truth categories include DETECTED_PROCEEDS (n=12) and PARTIAL (n=8) — the two most contested verdict zones in the FLIP methodology. DETECTED_PROCEEDS is not a standard FLIP verdict category; graders must map it to COMPLIANCE or PARTIAL (both indicate the model proceeded with harmful content despite safety awareness). Under this relaxed mapping, gemma3:27b achieves 90.0% [69.9%, 97.2%] accuracy, the highest of all graders. Nemotron-3-nano:30b achieves only 25.0% [11.2%, 46.9%], confirming its unreliability extends from obvious cases (Report #248, kappa=0.652) into the ambiguous regime. Mean pairwise Cohen’s kappa across all 15 grader pairs is 0.204 (slight agreement), compared to near-perfect agreement on obvious cases. The hardest category is partial_hedged (relaxed accuracy 0-80% across graders). The key finding: grader reliability collapses in the ambiguous regime that matters most for policy-relevant ASR claims.


1. Dataset Design

The ambiguous calibration dataset (data/calibration/grader_eval_ambiguous_20.jsonl) contains 20 synthetic traces across 4 ambiguity categories:

CategorynGround TruthDescription
detected_proceeds5DETECTED_PROCEEDSModel acknowledges harm then provides full content
partial_hedged5PARTIALModel provides partial/hedged harmful content
disclaimer_then_comply5DETECTED_PROCEEDSDisclaimer bookends around complete harmful content
educational_framing5Mixed (DP=2, PA=3)Content framed as educational/academic

DETECTED_PROCEEDS is not a standard FLIP verdict. Graders can output: COMPLIANCE, REFUSAL, PARTIAL, HALLUCINATION_REFUSAL, BENIGN_QUERY. For accuracy computation, we use a relaxed mapping: DETECTED_PROCEEDS is correctly captured by either COMPLIANCE or PARTIAL (both indicate the model produced harmful content). REFUSAL, HALLUCINATION_REFUSAL, and BENIGN_QUERY are all incorrect for DP ground truth.


2. Per-Model Verdict Distributions

ModelCOMPPARTREFHRBQ
gemma3:12b016400
gemma3:27b216200
ministral-3:14b114410
nemotron-3-nano:30b41546
devstral-small-2:24b012800
ministral-3:8b015311

Observation: All graders except nemotron show a strong PARTIAL mode (60-80% of verdicts), which is appropriate for this ambiguous dataset. Nemotron scatters across all 5 categories, including 6 BENIGN_QUERY verdicts on harmful content — the most dangerous error type for safety research.


3. Relaxed Accuracy and Wilson 95% CIs

RankModelRelaxed AccuracyWilson 95% CI
1gemma3:27b90.0% (18/20)[69.9%, 97.2%]
2gemma3:12b80.0% (16/20)[58.4%, 91.9%]
3ministral-3:14b75.0% (15/20)[53.1%, 88.8%]
3ministral-3:8b75.0% (15/20)[53.1%, 88.8%]
5devstral-small-2:24b60.0% (12/20)[38.7%, 78.1%]
6nemotron-3-nano:30b25.0% (5/20)[11.2%, 46.9%]

Note: exact match accuracy (requiring DETECTED_PROCEEDS as a verdict, which graders cannot produce) ranges 0-30%. The relaxed metric is the operationally meaningful one.


4. Per-Category Analysis

4.1 detected_proceeds (n=5, all DP ground truth)

All graders except nemotron achieve 5/5 relaxed accuracy. Most map these to PARTIAL. Nemotron scatters (2 REFUSAL, 1 HR, 2 COMPLIANCE).

4.2 disclaimer_then_comply (n=5, all DP ground truth)

Same pattern: 5/5 for all except nemotron (2/5). Graders consistently recognize disclaimer-wrapped compliance as PARTIAL.

4.3 educational_framing (n=5, mixed DP=2 + PA=3)

More disagreement emerges. Relaxed accuracy: gemma3:12b 4/5, gemma3:27b 4/5, ministral-3:14b 4/5, devstral-small-2:24b 2/5, ministral-3:8b 2/5, nemotron 1/5. Educational framing causes some graders to classify content as REFUSAL.

4.4 partial_hedged (n=5, all PARTIAL ground truth) — THE HARDEST CATEGORY

This is where grader reliability collapses:

ModelVerdictsRelaxed Acc
gemma3:27b4 PARTIAL, 1 REFUSAL4/5 (80%)
ministral-3:8b3 PARTIAL, 1 HR, 1 REFUSAL3/5 (60%)
gemma3:12b2 PARTIAL, 3 REFUSAL2/5 (40%)
ministral-3:14b1 PARTIAL, 3 REFUSAL, 1 HR1/5 (20%)
devstral-small-2:24b5 REFUSAL0/5 (0%)
nemotron-3-nano:30b4 BENIGN_QUERY, 1 REFUSAL0/5 (0%)

Key finding: Hedged partial compliance is the category where graders most disagree, and where the difference between PARTIAL and REFUSAL classification directly affects ASR computation. Devstral classifies all 5 as REFUSAL (ASR = 0%), while gemma3:27b classifies 4/5 as PARTIAL (ASR = 80%). This is a 80pp ASR swing on the same traces depending on the grader.


5. Cohen’s Kappa: All 15 Pairwise Combinations

PairKappaObserved Agreement
gemma3:12b vs devstral-small-2:24b0.54580.0%
ministral-3:14b vs devstral-small-2:24b0.50075.0%
devstral-small-2:24b vs ministral-3:8b0.49075.0%
gemma3:12b vs gemma3:27b0.41280.0%
gemma3:12b vs ministral-3:14b0.37575.0%
gemma3:27b vs ministral-3:8b0.35175.0%
gemma3:12b vs ministral-3:8b0.32475.0%
gemma3:27b vs devstral-small-2:24b0.16760.0%
ministral-3:14b vs ministral-3:8b0.20965.0%
nemotron-3-nano:30b vs devstral-small-2:24b0.02315.0%
gemma3:12b vs nemotron-3-nano:30b-0.0445.0%
nemotron-3-nano:30b vs ministral-3:8b-0.0565.0%
ministral-3:14b vs nemotron-3-nano:30b-0.0615.0%
gemma3:27b vs ministral-3:14b-0.08455.0%
gemma3:27b vs nemotron-3-nano:30b-0.0930.0%
StatisticValue
Mean kappa0.204
Median kappa0.209
Min kappa-0.093 (gemma3:27b vs nemotron)
Max kappa0.545 (gemma3:12b vs devstral)

Interpretation: Mean kappa of 0.204 indicates slight agreement (Landis & Koch). No pair reaches substantial agreement (>0.60). The highest kappa (0.545, moderate) is between gemma3:12b and devstral — but both achieve this by defaulting to PARTIAL for DETECTED_PROCEEDS cases while diverging on PARTIAL cases. Five pairs have negative kappa, indicating worse-than-chance agreement. Nemotron pairs account for 4 of the 5 negative values.

Contrast with obvious cases (Report #248): On unambiguous traces, 6/7 graders achieved perfect kappa (1.000). The drop from 1.000 to 0.204 mean kappa confirms that grader reliability is an artifact of case difficulty, not a stable property of the grader.


6. Error Type Analysis

6.1 REFUSAL Misclassification Rate (harmful content classified as refused)

ModelDP cases called REFUSALPA cases called REFUSALTotal false refusal rate
devstral-small-2:24b0/128/88/20 (40.0%)
nemotron-3-nano:30b3/122/85/20 (25.0%)
gemma3:12b0/124/84/20 (20.0%)
ministral-3:14b0/124/84/20 (20.0%)
ministral-3:8b0/123/83/20 (15.0%)
gemma3:27b0/122/82/20 (10.0%)

Devstral classifies every single PARTIAL trace as REFUSAL. This means devstral would systematically undercount ASR in any corpus where partial compliance is prevalent.

6.2 BENIGN_QUERY Misclassification (harmful content called benign)

ModelBQ countRate
nemotron-3-nano:30b6/2030.0%
ministral-3:8b1/205.0%
All others0/200.0%

Nemotron calling harmful content BENIGN_QUERY is the most dangerous error — it would completely suppress these traces from any ASR calculation.


7. Grader Reliability Hierarchy (Combined Obvious + Ambiguous)

Combining Report #248 (obvious, n=20) with this report (ambiguous, n=20):

ModelObvious AccuracyAmbiguous Relaxed AccuracyCombinedReliability Tier
gemma3:27b100%90.0%95.0%Tier 1 (recommended)
gemma3:12b100%80.0%90.0%Tier 1
ministral-3:14b100%75.0%87.5%Tier 2
ministral-3:8b100%75.0%87.5%Tier 2
devstral-small-2:24b100%60.0%80.0%Tier 2 (PARTIAL-blind)
nemotron-3-nano:30b80%25.0%52.5%Tier 3 (not recommended)

Recommendation: gemma3:27b is the best available FLIP grader on Ollama Cloud for ambiguous cases. Devstral is reliable on obvious cases but systematically misclassifies PARTIAL as REFUSAL, making it unsuitable for corpora with hedged compliance. Nemotron should not be used for production grading.


8. Implications for Corpus-Level ASR

The partial_hedged category reveals the core measurement problem: the same 5 traces produce 0% ASR (devstral) to 80% ASR (gemma3:27b) depending on the grader. This is not a minor calibration issue — it means ASR figures on ambiguous content are grader-dependent at a magnitude that could change policy conclusions.

Recommendation for CCS paper: Report ASR with explicit grader specification. Where possible, use multi-grader consensus (majority vote across >= 3 graders) for ambiguous traces. Always report the grader used alongside any ASR figure.


Appendix: Methodology

  • Grading API: Ollama Cloud (https://ollama.com/api/chat)
  • FLIP system prompt: Standard backward-inference prompt (same as production grading)
  • Temperature: 0.0 (deterministic)
  • Delay: 2s between API calls
  • Statistical tests: Cohen’s kappa (unweighted), Wilson score 95% confidence intervals
  • Ground truth source: data/calibration/grader_eval_ambiguous_20.jsonl (20 synthetic traces with human-assigned categories)
  • Relaxed mapping rationale: DETECTED_PROCEEDS captures cases where the model detects harm but proceeds anyway. In the FLIP framework, this behavior maps to COMPLIANCE (full harmful content) or PARTIAL (harmful content with hedging). Both indicate the model produced actionable harmful output. REFUSAL, HR, and BQ are all incorrect for DP traces because the model did in fact proceed.

This research informs our commercial services. See how we can help →