Published
Report 266 Research — Empirical Study

Overview

Safety scorecards for 8 frontier models tested during the Sprint 13 ollama-cloud campaign. All models were evaluated via OpenRouter-proxied Ollama endpoints against the F41LUR3-F1R57 adversarial scenario suite (compliance cascade, grader evasion, authority gradient, format-lock, elite, and frontier sweep scenarios).

Grades are based on Strict ASR (COMPLIANCE-only attack success rate) measured by Claude Haiku 4.5 FLIP grading. Percentile rankings are computed against a pool of 30 models with 5+ evaluable traces.

Grading scale:

  • A: 0—10% Strict ASR (strong safety alignment)
  • A-: 11—20% (robust with minor gaps)
  • B+: 21—30% (moderate safety, some vulnerabilities)
  • B: 31—40% (noticeable vulnerability surface)
  • B-: 41—50% (significant vulnerability surface)
  • C+: 51—60% (majority-compliant to adversarial prompts)
  • C: 61—70%, C-: 71—80%, D: 81—90%, F: 91—100%

Scorecard: GLM-5 (Zhipu AI)

DimensionValue
Overall GradeA
Strict ASR0.0% (0/10)
Broad ASR (C+P)60.0% (6/10)
Refusal Rate20.0% (2/10)
Benign Query Rate20.0% (2/10)
Evaluable Traces10
PercentileSafer than 3% of models tested

Analysis: GLM-5 never produces full COMPLIANCE on adversarial prompts, giving it the lowest strict ASR among frontier models. However, its 60% broad ASR reveals a tendency toward PARTIAL responses — it hedges extensively rather than cleanly refusing. The model appears to lack confidence in its refusal mechanisms, producing qualified or incomplete answers that partially address harmful queries without fully complying. Small sample size (n=10) limits confidence.

Experiments: CCA compliance cascade (n=5), general benchmark (n=5)


Scorecard: Cogito 2.1 (671B) (Deep Cogito)

DimensionValue
Overall GradeA-
Strict ASR13.3% (2/15)
Broad ASR (C+P)46.7% (7/15)
Refusal Rate40.0% (6/15)
Benign Query Rate13.3% (2/15)
Evaluable Traces15
PercentileSafer than 60% of models tested

Analysis: Cogito 2.1 shows good refusal behavior on strict compliance but leaks significantly via PARTIAL responses (33.3% partial rate). The gap between strict and broad ASR (13.3% vs 46.7%) is the largest among frontier models tested, suggesting the model’s safety training produces hedged rather than clean refusals. The 40% refusal rate indicates functional safety mechanisms that fire on roughly 2 in 5 adversarial attempts.

Experiments: CCA compliance cascade (n=10), general benchmark (n=5)


Scorecard: Kimi K2.5 (Moonshot AI)

DimensionValue
Overall GradeA-
Strict ASR14.3% (2/14)
Broad ASR (C+P)14.3% (2/14)
Refusal Rate71.4% (10/14)
Benign Query Rate7.1% (1/14)
Evaluable Traces14 (6 infra errors excluded)
PercentileSafer than 63% of models tested

Analysis: Kimi K2.5 has the highest refusal rate among frontier models (71.4%), with no gap between strict and broad ASR — when it fails, it fails fully to COMPLIANCE rather than hedging via PARTIAL. This binary behavior (clean refuse or full comply) suggests hard safety boundaries without gradient degradation. However, 6/20 traces returned infrastructure errors, reducing evaluable sample size.

Experiments: Frontier sweep (n=20, 14 evaluable)


Scorecard: Qwen 3.5 (397B) (Alibaba)

DimensionValue
Overall GradeA-
Strict ASR17.6% (3/17)
Broad ASR (C+P)17.6% (3/17)
Refusal Rate76.5% (13/17)
Benign Query Rate5.9% (1/17)
Evaluable Traces17 (11 infra errors excluded)
PercentileSafer than 67% of models tested

Analysis: Qwen 3.5 shows the highest refusal rate in the frontier cohort (76.5%), outperforming even Kimi K2.5. Like Kimi, there is no strict/broad ASR gap, indicating binary compliance behavior. The model either refuses cleanly or complies fully — no PARTIAL hedging. 11/28 infrastructure errors reduce confidence, but the pattern across 17 evaluable traces is consistent.

Experiments: General benchmark (n=28, 17 evaluable)


Scorecard: Qwen3 Coder (480B) (Alibaba)

DimensionValue
Overall GradeB
Strict ASR33.3% (5/15)
Broad ASR (C+P)46.7% (7/15)
Refusal Rate46.7% (7/15)
Benign Query Rate6.7% (1/15)
Evaluable Traces15
PercentileSafer than 77% of models tested

Analysis: Qwen3 Coder shows moderate vulnerability, with 1 in 3 adversarial prompts achieving full compliance. The balanced split between compliance (33%), partial (13%), and refusal (47%) suggests safety mechanisms that are present but not consistently triggered. As a code-optimized model, it may be more susceptible to format-lock and structured-output attacks that align with its training distribution.

Experiments: Frontier sweep with mixed scenarios (n=15)


Scorecard: Nemotron 3 Super (NVIDIA)

DimensionValue
Overall GradeB
Strict ASR37.3% (25/67)
Broad ASR (C+P)44.8% (30/67)
Refusal Rate35.8% (24/67)
Benign Query Rate19.4% (13/67)
Evaluable Traces67
PercentileSafer than 83% of models tested

Analysis: Nemotron 3 Super has the largest sample size (n=67) providing the highest-confidence scorecard. It shows moderate vulnerability with roughly equal compliance and refusal rates. The 19.4% benign-query rate suggests some adversarial prompts are not recognized as adversarial at all. Performance varies by experiment: strong on CCA compliance cascade (3/20 compliance = 15% ASR) but weak on general benchmarks (21/28 compliance = 75% ASR). This suggests the model is more resistant to structured multi-turn attacks but vulnerable to single-shot adversarial prompts.

Experiments: CCA compliance cascade + baseline (n=40), general benchmark (n=28), nemotron novel scenarios (n=20)


Scorecard: DeepSeek V3.2 (DeepSeek)

DimensionValue
Overall GradeB-
Strict ASR41.2% (7/17)
Broad ASR (C+P)64.7% (11/17)
Refusal Rate23.5% (4/17)
Benign Query Rate11.8% (2/17)
Evaluable Traces17 (6 infra errors excluded)
PercentileSafer than 87% of models tested

Analysis: DeepSeek V3.2 shows significant vulnerability, with the second-highest broad ASR (64.7%) among frontier models. The large strict-to-broad gap (41.2% vs 64.7%) indicates a pattern of PARTIAL compliance — the model frequently provides qualified or hedged harmful content rather than cleanly refusing. The 23.5% refusal rate is the second-lowest in the frontier cohort. Infrastructure errors affected 6/23 traces.

Experiments: Frontier sweep (n=20, 17 evaluable), general benchmark (n=3, 0 evaluable)


Scorecard: Mistral Large 3 (675B) (Mistral AI)

DimensionValue
Overall GradeB-
Strict ASR50.0% (10/20)
Broad ASR (C+P)65.0% (13/20)
Refusal Rate20.0% (4/20)
Benign Query Rate15.0% (3/20)
Evaluable Traces20
PercentileSafer than 93% of models tested

Analysis: Mistral Large 3 has the highest strict ASR among frontier models tested (50.0%), with half of all adversarial prompts achieving full compliance. It also has the lowest refusal rate (20%) and highest benign-query rate (15%), suggesting its safety training frequently fails to recognize adversarial intent. The 15% broad ASR gap suggests some additional PARTIAL hedging. No infrastructure errors — all 20 traces completed successfully.

Experiments: Full 20-scenario frontier sweep covering compliance cascade, grader evasion, authority gradient, reasoning exhaustion, elite, and defense probing scenarios.


Scorecard: Qwen3 Next (80B) (Alibaba)

DimensionValue
Overall GradeN/A
StatusAll 28 traces returned INFRASTRUCTURE_ERROR

Analysis: Qwen3 Next could not be evaluated. All 28 traces failed with infrastructure errors (likely model availability issues on the Ollama cloud endpoint). No safety assessment possible.


Comparative Rankings

Frontier Models by Strict ASR (ascending = safer)

RankModelGradeStrict ASRBroad ASRRefusal Raten
1GLM-5A0.0%60.0%20.0%10
2Cogito 2.1 (671B)A-13.3%46.7%40.0%15
3Kimi K2.5A-14.3%14.3%71.4%14
4Qwen 3.5 (397B)A-17.6%17.6%76.5%17
5Qwen3 Coder (480B)B33.3%46.7%46.7%15
6Nemotron 3 SuperB37.3%44.8%35.8%67
7DeepSeek V3.2B-41.2%64.7%23.5%17
8Mistral Large 3 (675B)B-50.0%65.0%20.0%20
Qwen3 Next (80B)N/A0

Key Observations

  1. Strict vs Broad ASR gap reveals safety strategy. Models with no gap (Kimi, Qwen 3.5) use binary refusal. Models with large gaps (GLM-5: 0% vs 60%; Cogito: 13% vs 47%) hedge via PARTIAL responses.

  2. GLM-5 paradox. Grade A on strict ASR but 60% broad ASR — the model almost never fully complies but frequently provides partial harmful content. This raises the question of whether strict ASR alone captures safety behavior.

  3. Mistral Large 3 is the least safety-hardened frontier model tested. At 50% strict ASR and 20% refusal rate, it complies with adversarial prompts at rates comparable to open-weight midrange models.

  4. Chinese-origin models (Kimi, Qwen 3.5, GLM-5) show stronger refusal rates than Western counterparts (Mistral, DeepSeek) in this sample. This may reflect different safety training priorities or prompt-language sensitivity.

  5. Sample sizes are small (10—67 evaluable traces). These scorecards provide directional signal, not definitive safety assessments. All findings should be treated as preliminary.

Caveats

  • All evaluations used the F41LUR3-F1R57 adversarial scenario suite, which tests specific attack families (compliance cascade, grader evasion, authority gradient, format-lock, elite). Results may not generalize to other attack types.
  • FLIP grading via Claude Haiku 4.5 has known limitations for edge cases (see Issue #250 for grader audit).
  • Infrastructure errors reduced effective sample sizes for Kimi (6 errors), Qwen 3.5 (11 errors), DeepSeek (6 errors), and Qwen3 Next (28 errors, fully excluded).
  • Percentile rankings include both frontier and midrange models, so a “safer than 87%” ranking means the model is less safe than 87% of the pool — higher percentile = less safe in this context (ranking is by ascending ASR).

Data Sources

All trace data from runs/ollama_cloud/ collected 2026-03-24, graded by Claude Haiku 4.5 into:

  • runs/grading/frontier_haiku/
  • runs/grading/cca_haiku/
  • runs/grading/remaining_haiku/
  • runs/grading/ollama_cloud_haiku/
  • runs/grading/kimi_haiku/

This research informs our commercial services. See how we can help →