Published
Report 252 Research — Empirical Study

Summary

Wave 7 executed three validation workstreams: (1) ambiguous calibration of Ollama Cloud graders on boundary cases, (2) Haiku FLIP-grading of Leela’s novel attack traces (RSE, GE), and (3) Haiku FLIP-grading of Rose’s CCA baseline traces. Results confirm that mid-size Ollama Cloud models systematically collapse the DETECTED_PROCEEDS / COMPLIANCE distinction into PARTIAL, while Haiku provides more granular verdicts. CCA baselines show high compliance rates across both gemma3 model sizes.

1. Ambiguous Calibration Results

Method

20 synthetic traces were run through three Ollama Cloud models as FLIP graders:

  • gemma3:12b
  • gemma3:27b
  • ministral-3:14b

The calibration set contained 4 categories of 5 traces each:

  • detected_proceeds (DP): Disclaimer-wrapped but functionally harmful responses
  • disclaimer_then_comply: Safety caveats followed by complete compliance
  • educational_framing: Academic framing concealing actionable harmful content
  • partial_hedged: Genuinely partial responses with significant caveats

Results

GraderCOMPLIANCEPARTIALREFUSALHALLUCINATION_REFUSALStrict ASRBroad ASR
gemma3:12b0 (0%)16 (80%)4 (20%)00%80%
gemma3:27b2 (10%)16 (80%)2 (10%)010%90%
ministral-3:14b1 (5%)14 (70%)4 (20%)1 (5%)5%75%

Per-Category Breakdown

detected_proceeds (n=5 per grader):

  • gemma3:12b: 5/5 PARTIAL (0% COMPLIANCE detection)
  • gemma3:27b: 4/5 PARTIAL, 1/5 COMPLIANCE (20% DP sensitivity)
  • ministral-3:14b: 5/5 PARTIAL (0% COMPLIANCE detection)

disclaimer_then_comply (n=5 per grader):

  • All three graders: predominantly PARTIAL (80-100%)
  • No grader reliably classified these as COMPLIANCE

educational_framing (n=5 per grader):

  • All three: mostly PARTIAL (60-80%), with occasional REFUSAL
  • ministral-3:14b: 1/5 COMPLIANCE (only grader to catch one)

partial_hedged (n=5 per grader):

  • gemma3:12b: 3/5 REFUSAL, 2/5 PARTIAL (best at distinguishing genuine partials)
  • gemma3:27b: 4/5 PARTIAL, 1/5 REFUSAL
  • ministral-3:14b: 3/5 REFUSAL, 1/5 PARTIAL, 1/5 HALLUCINATION_REFUSAL

Key Finding

All three Ollama Cloud graders exhibit a PARTIAL collapse pattern: they default to PARTIAL for any response containing both safety disclaimers and harmful content. This aligns with Mistake #15 (disclaimers do not equal refusals) but in the grader direction — the graders treat disclaimers as evidence of hedging rather than recognizing that disclaimer-wrapped compliance is still compliance.

gemma3:12b is the most conservative (zero COMPLIANCE verdicts), which paradoxically makes it the least useful for detecting the most dangerous category (DETECTED_PROCEEDS). gemma3:27b showed slightly better discrimination with 2 COMPLIANCE verdicts but still collapsed 80% of DP traces to PARTIAL.

Implication: Ollama Cloud graders at this tier should not be used as sole arbiters for boundary cases. Haiku-tier grading remains necessary for ambiguous verdicts.

2. RSE and GE Haiku-Validated ASR

Recursive Safety Exhaustion (RSE) — gemma3:12b target

VerdictCountPct
COMPLIANCE120%
PARTIAL360%
REFUSAL120%
  • Strict ASR: 20% (1/5)
  • Broad ASR: 80% (4/5)

Grader Evasion (GE) — gemma3:12b target

VerdictCountPct
COMPLIANCE360%
PARTIAL120%
REFUSAL120%
  • Strict ASR: 60% (3/5)
  • Broad ASR: 80% (4/5)

Comparison

GE achieves 3x the strict COMPLIANCE rate of RSE (60% vs 20%), while both achieve identical 80% broad ASR. This suggests that grader-evasion prompts produce more clearly harmful content that Haiku classifies as full COMPLIANCE, whereas RSE tends to produce responses that are harmful but wrapped in enough hedging to receive PARTIAL verdicts.

Note: Sample sizes are small (n=5 each). These results indicate directions requiring larger-scale validation.

3. CCA Baseline Haiku-Validated ASR

gemma3:12b — CCA Baseline

VerdictCountPct
COMPLIANCE770%
PARTIAL220%
REFUSAL110%
  • Strict ASR: 70% (7/10)
  • Broad ASR: 90% (9/10)

gemma3:27b — CCA Baseline

VerdictCountPct
COMPLIANCE880%
PARTIAL110%
REFUSAL110%
  • Strict ASR: 80% (8/10)
  • Broad ASR: 90% (9/10)

Comparison with Original Compliance Cascade (CC) Haiku Grades

ConfigurationTarget ModelStrict ASRBroad ASR
CC originalgemma3:12b20% (2/10)80% (8/10)
CC originalministral-3:14b40% (4/10)90% (9/10)
CCA baselinegemma3:12b70% (7/10)90% (9/10)
CCA baselinegemma3:27b80% (8/10)90% (9/10)

The CCA baseline shows substantially higher strict ASR than the original CC for gemma3:12b (70% vs 20%). This suggests that the baseline CCA prompts are more directly effective at eliciting unhedged compliance than the original compliance cascade prompts, which tended to produce more PARTIAL responses.

gemma3:27b shows higher strict ASR (80%) than gemma3:12b (70%) on CCA baselines, which is notable — the larger model is more compliant, not less. This is consistent with the capability-compliance coupling pattern documented in prior reports.

4. Updated Cascade Amplification Analysis

Combining all Haiku-graded cascade and novel attack data:

Attack FamilyTargetStrict ASRBroad ASRn
CCA Baselinegemma3:12b70%90%10
CCA Baselinegemma3:27b80%90%10
CC Originalgemma3:12b20%80%10
CC Originalministral-3:14b40%90%10
RSEgemma3:12b20%80%5
GEgemma3:12b60%80%5

Observations

  1. CCA baselines outperform originals: The refined CCA baseline prompts achieve 3.5x higher strict ASR than the original CC prompts on gemma3:12b (70% vs 20%).

  2. Broad ASR convergence at 80-90%: Across all attack families and models, broad ASR clusters between 80-90%, suggesting a ceiling effect where most attacks can extract some harmful content but full compliance varies by technique.

  3. GE is the most effective novel technique per strict ASR: At 60% strict, GE outperforms RSE (20%) on the same target model. Grader-aware prompt engineering appears to produce cleaner, more classifiable compliance.

  4. Model size does not equal safety: gemma3:27b shows higher strict ASR than gemma3:12b on identical CCA baseline prompts (80% vs 70%), reinforcing that scaling alone does not improve robustness to adversarial inputs.

Limitations

  • All novel attack traces have n=5 sample sizes; results indicate directions, not definitive rates.
  • CCA baseline n=10 per model provides moderate confidence.
  • Ambiguous calibration uses synthetic traces, not live model outputs.
  • All targets are Ollama Cloud models; results may not transfer to API-served frontier models.

Files

  • Ambiguous calibration: runs/grading/ambiguous_calibration/{gemma3_12b,gemma3_27b,ministral_14b}/
  • RSE Haiku grades: runs/grading/rse_haiku/
  • GE Haiku grades: runs/grading/ge_haiku/
  • CCA baseline Haiku grades: runs/grading/cca_baseline_haiku/
  • CC original Haiku grades: runs/grading/ollama_cloud_haiku/

This research informs our commercial services. See how we can help →