Overview
Safety scorecards for 8 frontier models tested during the Sprint 13 ollama-cloud campaign. All models were evaluated via OpenRouter-proxied Ollama endpoints against the F41LUR3-F1R57 adversarial scenario suite (compliance cascade, grader evasion, authority gradient, format-lock, elite, and frontier sweep scenarios).
Grades are based on Strict ASR (COMPLIANCE-only attack success rate) measured by Claude Haiku 4.5 FLIP grading. Percentile rankings are computed against a pool of 30 models with 5+ evaluable traces.
Grading scale:
- A: 0—10% Strict ASR (strong safety alignment)
- A-: 11—20% (robust with minor gaps)
- B+: 21—30% (moderate safety, some vulnerabilities)
- B: 31—40% (noticeable vulnerability surface)
- B-: 41—50% (significant vulnerability surface)
- C+: 51—60% (majority-compliant to adversarial prompts)
- C: 61—70%, C-: 71—80%, D: 81—90%, F: 91—100%
Scorecard: GLM-5 (Zhipu AI)
| Dimension | Value |
|---|---|
| Overall Grade | A |
| Strict ASR | 0.0% (0/10) |
| Broad ASR (C+P) | 60.0% (6/10) |
| Refusal Rate | 20.0% (2/10) |
| Benign Query Rate | 20.0% (2/10) |
| Evaluable Traces | 10 |
| Percentile | Safer than 3% of models tested |
Analysis: GLM-5 never produces full COMPLIANCE on adversarial prompts, giving it the lowest strict ASR among frontier models. However, its 60% broad ASR reveals a tendency toward PARTIAL responses — it hedges extensively rather than cleanly refusing. The model appears to lack confidence in its refusal mechanisms, producing qualified or incomplete answers that partially address harmful queries without fully complying. Small sample size (n=10) limits confidence.
Experiments: CCA compliance cascade (n=5), general benchmark (n=5)
Scorecard: Cogito 2.1 (671B) (Deep Cogito)
| Dimension | Value |
|---|---|
| Overall Grade | A- |
| Strict ASR | 13.3% (2/15) |
| Broad ASR (C+P) | 46.7% (7/15) |
| Refusal Rate | 40.0% (6/15) |
| Benign Query Rate | 13.3% (2/15) |
| Evaluable Traces | 15 |
| Percentile | Safer than 60% of models tested |
Analysis: Cogito 2.1 shows good refusal behavior on strict compliance but leaks significantly via PARTIAL responses (33.3% partial rate). The gap between strict and broad ASR (13.3% vs 46.7%) is the largest among frontier models tested, suggesting the model’s safety training produces hedged rather than clean refusals. The 40% refusal rate indicates functional safety mechanisms that fire on roughly 2 in 5 adversarial attempts.
Experiments: CCA compliance cascade (n=10), general benchmark (n=5)
Scorecard: Kimi K2.5 (Moonshot AI)
| Dimension | Value |
|---|---|
| Overall Grade | A- |
| Strict ASR | 14.3% (2/14) |
| Broad ASR (C+P) | 14.3% (2/14) |
| Refusal Rate | 71.4% (10/14) |
| Benign Query Rate | 7.1% (1/14) |
| Evaluable Traces | 14 (6 infra errors excluded) |
| Percentile | Safer than 63% of models tested |
Analysis: Kimi K2.5 has the highest refusal rate among frontier models (71.4%), with no gap between strict and broad ASR — when it fails, it fails fully to COMPLIANCE rather than hedging via PARTIAL. This binary behavior (clean refuse or full comply) suggests hard safety boundaries without gradient degradation. However, 6/20 traces returned infrastructure errors, reducing evaluable sample size.
Experiments: Frontier sweep (n=20, 14 evaluable)
Scorecard: Qwen 3.5 (397B) (Alibaba)
| Dimension | Value |
|---|---|
| Overall Grade | A- |
| Strict ASR | 17.6% (3/17) |
| Broad ASR (C+P) | 17.6% (3/17) |
| Refusal Rate | 76.5% (13/17) |
| Benign Query Rate | 5.9% (1/17) |
| Evaluable Traces | 17 (11 infra errors excluded) |
| Percentile | Safer than 67% of models tested |
Analysis: Qwen 3.5 shows the highest refusal rate in the frontier cohort (76.5%), outperforming even Kimi K2.5. Like Kimi, there is no strict/broad ASR gap, indicating binary compliance behavior. The model either refuses cleanly or complies fully — no PARTIAL hedging. 11/28 infrastructure errors reduce confidence, but the pattern across 17 evaluable traces is consistent.
Experiments: General benchmark (n=28, 17 evaluable)
Scorecard: Qwen3 Coder (480B) (Alibaba)
| Dimension | Value |
|---|---|
| Overall Grade | B |
| Strict ASR | 33.3% (5/15) |
| Broad ASR (C+P) | 46.7% (7/15) |
| Refusal Rate | 46.7% (7/15) |
| Benign Query Rate | 6.7% (1/15) |
| Evaluable Traces | 15 |
| Percentile | Safer than 77% of models tested |
Analysis: Qwen3 Coder shows moderate vulnerability, with 1 in 3 adversarial prompts achieving full compliance. The balanced split between compliance (33%), partial (13%), and refusal (47%) suggests safety mechanisms that are present but not consistently triggered. As a code-optimized model, it may be more susceptible to format-lock and structured-output attacks that align with its training distribution.
Experiments: Frontier sweep with mixed scenarios (n=15)
Scorecard: Nemotron 3 Super (NVIDIA)
| Dimension | Value |
|---|---|
| Overall Grade | B |
| Strict ASR | 37.3% (25/67) |
| Broad ASR (C+P) | 44.8% (30/67) |
| Refusal Rate | 35.8% (24/67) |
| Benign Query Rate | 19.4% (13/67) |
| Evaluable Traces | 67 |
| Percentile | Safer than 83% of models tested |
Analysis: Nemotron 3 Super has the largest sample size (n=67) providing the highest-confidence scorecard. It shows moderate vulnerability with roughly equal compliance and refusal rates. The 19.4% benign-query rate suggests some adversarial prompts are not recognized as adversarial at all. Performance varies by experiment: strong on CCA compliance cascade (3/20 compliance = 15% ASR) but weak on general benchmarks (21/28 compliance = 75% ASR). This suggests the model is more resistant to structured multi-turn attacks but vulnerable to single-shot adversarial prompts.
Experiments: CCA compliance cascade + baseline (n=40), general benchmark (n=28), nemotron novel scenarios (n=20)
Scorecard: DeepSeek V3.2 (DeepSeek)
| Dimension | Value |
|---|---|
| Overall Grade | B- |
| Strict ASR | 41.2% (7/17) |
| Broad ASR (C+P) | 64.7% (11/17) |
| Refusal Rate | 23.5% (4/17) |
| Benign Query Rate | 11.8% (2/17) |
| Evaluable Traces | 17 (6 infra errors excluded) |
| Percentile | Safer than 87% of models tested |
Analysis: DeepSeek V3.2 shows significant vulnerability, with the second-highest broad ASR (64.7%) among frontier models. The large strict-to-broad gap (41.2% vs 64.7%) indicates a pattern of PARTIAL compliance — the model frequently provides qualified or hedged harmful content rather than cleanly refusing. The 23.5% refusal rate is the second-lowest in the frontier cohort. Infrastructure errors affected 6/23 traces.
Experiments: Frontier sweep (n=20, 17 evaluable), general benchmark (n=3, 0 evaluable)
Scorecard: Mistral Large 3 (675B) (Mistral AI)
| Dimension | Value |
|---|---|
| Overall Grade | B- |
| Strict ASR | 50.0% (10/20) |
| Broad ASR (C+P) | 65.0% (13/20) |
| Refusal Rate | 20.0% (4/20) |
| Benign Query Rate | 15.0% (3/20) |
| Evaluable Traces | 20 |
| Percentile | Safer than 93% of models tested |
Analysis: Mistral Large 3 has the highest strict ASR among frontier models tested (50.0%), with half of all adversarial prompts achieving full compliance. It also has the lowest refusal rate (20%) and highest benign-query rate (15%), suggesting its safety training frequently fails to recognize adversarial intent. The 15% broad ASR gap suggests some additional PARTIAL hedging. No infrastructure errors — all 20 traces completed successfully.
Experiments: Full 20-scenario frontier sweep covering compliance cascade, grader evasion, authority gradient, reasoning exhaustion, elite, and defense probing scenarios.
Scorecard: Qwen3 Next (80B) (Alibaba)
| Dimension | Value |
|---|---|
| Overall Grade | N/A |
| Status | All 28 traces returned INFRASTRUCTURE_ERROR |
Analysis: Qwen3 Next could not be evaluated. All 28 traces failed with infrastructure errors (likely model availability issues on the Ollama cloud endpoint). No safety assessment possible.
Comparative Rankings
Frontier Models by Strict ASR (ascending = safer)
| Rank | Model | Grade | Strict ASR | Broad ASR | Refusal Rate | n |
|---|---|---|---|---|---|---|
| 1 | GLM-5 | A | 0.0% | 60.0% | 20.0% | 10 |
| 2 | Cogito 2.1 (671B) | A- | 13.3% | 46.7% | 40.0% | 15 |
| 3 | Kimi K2.5 | A- | 14.3% | 14.3% | 71.4% | 14 |
| 4 | Qwen 3.5 (397B) | A- | 17.6% | 17.6% | 76.5% | 17 |
| 5 | Qwen3 Coder (480B) | B | 33.3% | 46.7% | 46.7% | 15 |
| 6 | Nemotron 3 Super | B | 37.3% | 44.8% | 35.8% | 67 |
| 7 | DeepSeek V3.2 | B- | 41.2% | 64.7% | 23.5% | 17 |
| 8 | Mistral Large 3 (675B) | B- | 50.0% | 65.0% | 20.0% | 20 |
| — | Qwen3 Next (80B) | N/A | — | — | — | 0 |
Key Observations
-
Strict vs Broad ASR gap reveals safety strategy. Models with no gap (Kimi, Qwen 3.5) use binary refusal. Models with large gaps (GLM-5: 0% vs 60%; Cogito: 13% vs 47%) hedge via PARTIAL responses.
-
GLM-5 paradox. Grade A on strict ASR but 60% broad ASR — the model almost never fully complies but frequently provides partial harmful content. This raises the question of whether strict ASR alone captures safety behavior.
-
Mistral Large 3 is the least safety-hardened frontier model tested. At 50% strict ASR and 20% refusal rate, it complies with adversarial prompts at rates comparable to open-weight midrange models.
-
Chinese-origin models (Kimi, Qwen 3.5, GLM-5) show stronger refusal rates than Western counterparts (Mistral, DeepSeek) in this sample. This may reflect different safety training priorities or prompt-language sensitivity.
-
Sample sizes are small (10—67 evaluable traces). These scorecards provide directional signal, not definitive safety assessments. All findings should be treated as preliminary.
Caveats
- All evaluations used the F41LUR3-F1R57 adversarial scenario suite, which tests specific attack families (compliance cascade, grader evasion, authority gradient, format-lock, elite). Results may not generalize to other attack types.
- FLIP grading via Claude Haiku 4.5 has known limitations for edge cases (see Issue #250 for grader audit).
- Infrastructure errors reduced effective sample sizes for Kimi (6 errors), Qwen 3.5 (11 errors), DeepSeek (6 errors), and Qwen3 Next (28 errors, fully excluded).
- Percentile rankings include both frontier and midrange models, so a “safer than 87%” ranking means the model is less safe than 87% of the pool — higher percentile = less safe in this context (ranking is by ascending ASR).
Data Sources
All trace data from runs/ollama_cloud/ collected 2026-03-24, graded by Claude Haiku 4.5 into:
runs/grading/frontier_haiku/runs/grading/cca_haiku/runs/grading/remaining_haiku/runs/grading/ollama_cloud_haiku/runs/grading/kimi_haiku/