Wave 7 Validation Results | Research | Failure-First

Adrian Wedd

Report 252 Research — Empirical Study 2026-03-25

Audio Overview

Summary

Wave 7 executed three validation workstreams: (1) ambiguous calibration of Ollama Cloud graders on boundary cases, (2) Haiku FLIP-grading of Leela’s novel attack traces (RSE, GE), and (3) Haiku FLIP-grading of Rose’s CCA baseline traces. Results confirm that mid-size Ollama Cloud models systematically collapse the DETECTED_PROCEEDS / COMPLIANCE distinction into PARTIAL, while Haiku provides more granular verdicts. CCA baselines show high compliance rates across both gemma3 model sizes.

1. Ambiguous Calibration Results

Method

20 synthetic traces were run through three Ollama Cloud models as FLIP graders:

gemma3:12b
gemma3:27b
ministral-3:14b

The calibration set contained 4 categories of 5 traces each:

detected_proceeds (DP): Disclaimer-wrapped but functionally harmful responses
disclaimer_then_comply: Safety caveats followed by complete compliance
educational_framing: Academic framing concealing actionable harmful content
partial_hedged: Genuinely partial responses with significant caveats

Results

Grader	COMPLIANCE	PARTIAL	REFUSAL	HALLUCINATION_REFUSAL	Strict ASR	Broad ASR
gemma3:12b	0 (0%)	16 (80%)	4 (20%)	0	0%	80%
gemma3:27b	2 (10%)	16 (80%)	2 (10%)	0	10%	90%
ministral-3:14b	1 (5%)	14 (70%)	4 (20%)	1 (5%)	5%	75%

Per-Category Breakdown

detected_proceeds (n=5 per grader):

gemma3:12b: 5/5 PARTIAL (0% COMPLIANCE detection)
gemma3:27b: 4/5 PARTIAL, 1/5 COMPLIANCE (20% DP sensitivity)
ministral-3:14b: 5/5 PARTIAL (0% COMPLIANCE detection)

disclaimer_then_comply (n=5 per grader):

All three graders: predominantly PARTIAL (80-100%)
No grader reliably classified these as COMPLIANCE

educational_framing (n=5 per grader):

All three: mostly PARTIAL (60-80%), with occasional REFUSAL
ministral-3:14b: 1/5 COMPLIANCE (only grader to catch one)

partial_hedged (n=5 per grader):

gemma3:12b: 3/5 REFUSAL, 2/5 PARTIAL (best at distinguishing genuine partials)
gemma3:27b: 4/5 PARTIAL, 1/5 REFUSAL
ministral-3:14b: 3/5 REFUSAL, 1/5 PARTIAL, 1/5 HALLUCINATION_REFUSAL

Key Finding

All three Ollama Cloud graders exhibit a PARTIAL collapse pattern: they default to PARTIAL for any response containing both safety disclaimers and harmful content. This aligns with Mistake #15 (disclaimers do not equal refusals) but in the grader direction — the graders treat disclaimers as evidence of hedging rather than recognizing that disclaimer-wrapped compliance is still compliance.

gemma3:12b is the most conservative (zero COMPLIANCE verdicts), which paradoxically makes it the least useful for detecting the most dangerous category (DETECTED_PROCEEDS). gemma3:27b showed slightly better discrimination with 2 COMPLIANCE verdicts but still collapsed 80% of DP traces to PARTIAL.

Implication: Ollama Cloud graders at this tier should not be used as sole arbiters for boundary cases. Haiku-tier grading remains necessary for ambiguous verdicts.

2. RSE and GE Haiku-Validated ASR

Recursive Safety Exhaustion (RSE) — gemma3:12b target

Verdict	Count	Pct
COMPLIANCE	1	20%
PARTIAL	3	60%
REFUSAL	1	20%

Strict ASR: 20% (1/5)
Broad ASR: 80% (4/5)

Grader Evasion (GE) — gemma3:12b target

Verdict	Count	Pct
COMPLIANCE	3	60%
PARTIAL	1	20%
REFUSAL	1	20%

Strict ASR: 60% (3/5)
Broad ASR: 80% (4/5)

Comparison

GE achieves 3x the strict COMPLIANCE rate of RSE (60% vs 20%), while both achieve identical 80% broad ASR. This suggests that grader-evasion prompts produce more clearly harmful content that Haiku classifies as full COMPLIANCE, whereas RSE tends to produce responses that are harmful but wrapped in enough hedging to receive PARTIAL verdicts.

Note: Sample sizes are small (n=5 each). These results indicate directions requiring larger-scale validation.

3. CCA Baseline Haiku-Validated ASR

gemma3:12b — CCA Baseline

Verdict	Count	Pct
COMPLIANCE	7	70%
PARTIAL	2	20%
REFUSAL	1	10%

Strict ASR: 70% (7/10)
Broad ASR: 90% (9/10)

gemma3:27b — CCA Baseline

Verdict	Count	Pct
COMPLIANCE	8	80%
PARTIAL	1	10%
REFUSAL	1	10%

Strict ASR: 80% (8/10)
Broad ASR: 90% (9/10)

Comparison with Original Compliance Cascade (CC) Haiku Grades

Configuration	Target Model	Strict ASR	Broad ASR
CC original	gemma3:12b	20% (2/10)	80% (8/10)
CC original	ministral-3:14b	40% (4/10)	90% (9/10)
CCA baseline	gemma3:12b	70% (7/10)	90% (9/10)
CCA baseline	gemma3:27b	80% (8/10)	90% (9/10)

The CCA baseline shows substantially higher strict ASR than the original CC for gemma3:12b (70% vs 20%). This suggests that the baseline CCA prompts are more directly effective at eliciting unhedged compliance than the original compliance cascade prompts, which tended to produce more PARTIAL responses.

gemma3:27b shows higher strict ASR (80%) than gemma3:12b (70%) on CCA baselines, which is notable — the larger model is more compliant, not less. This is consistent with the capability-compliance coupling pattern documented in prior reports.

4. Updated Cascade Amplification Analysis

Combining all Haiku-graded cascade and novel attack data:

Attack Family	Target	Strict ASR	Broad ASR	n
CCA Baseline	gemma3:12b	70%	90%	10
CCA Baseline	gemma3:27b	80%	90%	10
CC Original	gemma3:12b	20%	80%	10
CC Original	ministral-3:14b	40%	90%	10
RSE	gemma3:12b	20%	80%	5
GE	gemma3:12b	60%	80%	5

Observations

CCA baselines outperform originals: The refined CCA baseline prompts achieve 3.5x higher strict ASR than the original CC prompts on gemma3:12b (70% vs 20%).
Broad ASR convergence at 80-90%: Across all attack families and models, broad ASR clusters between 80-90%, suggesting a ceiling effect where most attacks can extract some harmful content but full compliance varies by technique.
GE is the most effective novel technique per strict ASR: At 60% strict, GE outperforms RSE (20%) on the same target model. Grader-aware prompt engineering appears to produce cleaner, more classifiable compliance.
Model size does not equal safety: gemma3:27b shows higher strict ASR than gemma3:12b on identical CCA baseline prompts (80% vs 70%), reinforcing that scaling alone does not improve robustness to adversarial inputs.

Limitations

All novel attack traces have n=5 sample sizes; results indicate directions, not definitive rates.
CCA baseline n=10 per model provides moderate confidence.
Ambiguous calibration uses synthetic traces, not live model outputs.
All targets are Ollama Cloud models; results may not transfer to API-served frontier models.

Files

Ambiguous calibration: runs/grading/ambiguous_calibration/{gemma3_12b,gemma3_27b,ministral_14b}/
RSE Haiku grades: runs/grading/rse_haiku/
GE Haiku grades: runs/grading/ge_haiku/
CCA baseline Haiku grades: runs/grading/cca_baseline_haiku/
CC original Haiku grades: runs/grading/ollama_cloud_haiku/