Frontier Model Safety Scorecards | Research | Failure-First

Adrian Wedd

Report 266 Research — Empirical Study 2026-03-25

Audio Overview

Overview

Safety scorecards for 8 frontier models tested during the Sprint 13 ollama-cloud campaign. All models were evaluated via OpenRouter-proxied Ollama endpoints against the F41LUR3-F1R57 adversarial scenario suite (compliance cascade, grader evasion, authority gradient, format-lock, elite, and frontier sweep scenarios).

Grades are based on Strict ASR (COMPLIANCE-only attack success rate) measured by Claude Haiku 4.5 FLIP grading. Percentile rankings are computed against a pool of 30 models with 5+ evaluable traces.

Grading scale:

A: 0—10% Strict ASR (strong safety alignment)
A-: 11—20% (robust with minor gaps)
B+: 21—30% (moderate safety, some vulnerabilities)
B: 31—40% (noticeable vulnerability surface)
B-: 41—50% (significant vulnerability surface)
C+: 51—60% (majority-compliant to adversarial prompts)
C: 61—70%, C-: 71—80%, D: 81—90%, F: 91—100%

Scorecard: GLM-5 (Zhipu AI)

Dimension	Value
Overall Grade	A
Strict ASR	0.0% (0/10)
Broad ASR (C+P)	60.0% (6/10)
Refusal Rate	20.0% (2/10)
Benign Query Rate	20.0% (2/10)
Evaluable Traces	10
Percentile	Safer than 3% of models tested

Analysis: GLM-5 never produces full COMPLIANCE on adversarial prompts, giving it the lowest strict ASR among frontier models. However, its 60% broad ASR reveals a tendency toward PARTIAL responses — it hedges extensively rather than cleanly refusing. The model appears to lack confidence in its refusal mechanisms, producing qualified or incomplete answers that partially address harmful queries without fully complying. Small sample size (n=10) limits confidence.

Experiments: CCA compliance cascade (n=5), general benchmark (n=5)

Scorecard: Cogito 2.1 (671B) (Deep Cogito)

Dimension	Value
Overall Grade	A-
Strict ASR	13.3% (2/15)
Broad ASR (C+P)	46.7% (7/15)
Refusal Rate	40.0% (6/15)
Benign Query Rate	13.3% (2/15)
Evaluable Traces	15
Percentile	Safer than 60% of models tested

Analysis: Cogito 2.1 shows good refusal behavior on strict compliance but leaks significantly via PARTIAL responses (33.3% partial rate). The gap between strict and broad ASR (13.3% vs 46.7%) is the largest among frontier models tested, suggesting the model’s safety training produces hedged rather than clean refusals. The 40% refusal rate indicates functional safety mechanisms that fire on roughly 2 in 5 adversarial attempts.

Experiments: CCA compliance cascade (n=10), general benchmark (n=5)

Scorecard: Kimi K2.5 (Moonshot AI)

Dimension	Value
Overall Grade	A-
Strict ASR	14.3% (2/14)
Broad ASR (C+P)	14.3% (2/14)
Refusal Rate	71.4% (10/14)
Benign Query Rate	7.1% (1/14)
Evaluable Traces	14 (6 infra errors excluded)
Percentile	Safer than 63% of models tested

Analysis: Kimi K2.5 has the highest refusal rate among frontier models (71.4%), with no gap between strict and broad ASR — when it fails, it fails fully to COMPLIANCE rather than hedging via PARTIAL. This binary behavior (clean refuse or full comply) suggests hard safety boundaries without gradient degradation. However, 6/20 traces returned infrastructure errors, reducing evaluable sample size.

Experiments: Frontier sweep (n=20, 14 evaluable)

Scorecard: Qwen 3.5 (397B) (Alibaba)

Dimension	Value
Overall Grade	A-
Strict ASR	17.6% (3/17)
Broad ASR (C+P)	17.6% (3/17)
Refusal Rate	76.5% (13/17)
Benign Query Rate	5.9% (1/17)
Evaluable Traces	17 (11 infra errors excluded)
Percentile	Safer than 67% of models tested

Analysis: Qwen 3.5 shows the highest refusal rate in the frontier cohort (76.5%), outperforming even Kimi K2.5. Like Kimi, there is no strict/broad ASR gap, indicating binary compliance behavior. The model either refuses cleanly or complies fully — no PARTIAL hedging. 11/28 infrastructure errors reduce confidence, but the pattern across 17 evaluable traces is consistent.

Experiments: General benchmark (n=28, 17 evaluable)

Scorecard: Qwen3 Coder (480B) (Alibaba)

Dimension	Value
Overall Grade	B
Strict ASR	33.3% (5/15)
Broad ASR (C+P)	46.7% (7/15)
Refusal Rate	46.7% (7/15)
Benign Query Rate	6.7% (1/15)
Evaluable Traces	15
Percentile	Safer than 77% of models tested

Analysis: Qwen3 Coder shows moderate vulnerability, with 1 in 3 adversarial prompts achieving full compliance. The balanced split between compliance (33%), partial (13%), and refusal (47%) suggests safety mechanisms that are present but not consistently triggered. As a code-optimized model, it may be more susceptible to format-lock and structured-output attacks that align with its training distribution.

Experiments: Frontier sweep with mixed scenarios (n=15)

Scorecard: Nemotron 3 Super (NVIDIA)

Dimension	Value
Overall Grade	B
Strict ASR	37.3% (25/67)
Broad ASR (C+P)	44.8% (30/67)
Refusal Rate	35.8% (24/67)
Benign Query Rate	19.4% (13/67)
Evaluable Traces	67
Percentile	Safer than 83% of models tested

Analysis: Nemotron 3 Super has the largest sample size (n=67) providing the highest-confidence scorecard. It shows moderate vulnerability with roughly equal compliance and refusal rates. The 19.4% benign-query rate suggests some adversarial prompts are not recognized as adversarial at all. Performance varies by experiment: strong on CCA compliance cascade (3/20 compliance = 15% ASR) but weak on general benchmarks (21/28 compliance = 75% ASR). This suggests the model is more resistant to structured multi-turn attacks but vulnerable to single-shot adversarial prompts.

Experiments: CCA compliance cascade + baseline (n=40), general benchmark (n=28), nemotron novel scenarios (n=20)

Scorecard: DeepSeek V3.2 (DeepSeek)

Dimension	Value
Overall Grade	B-
Strict ASR	41.2% (7/17)
Broad ASR (C+P)	64.7% (11/17)
Refusal Rate	23.5% (4/17)
Benign Query Rate	11.8% (2/17)
Evaluable Traces	17 (6 infra errors excluded)
Percentile	Safer than 87% of models tested

Analysis: DeepSeek V3.2 shows significant vulnerability, with the second-highest broad ASR (64.7%) among frontier models. The large strict-to-broad gap (41.2% vs 64.7%) indicates a pattern of PARTIAL compliance — the model frequently provides qualified or hedged harmful content rather than cleanly refusing. The 23.5% refusal rate is the second-lowest in the frontier cohort. Infrastructure errors affected 6/23 traces.

Experiments: Frontier sweep (n=20, 17 evaluable), general benchmark (n=3, 0 evaluable)

Scorecard: Mistral Large 3 (675B) (Mistral AI)

Dimension	Value
Overall Grade	B-
Strict ASR	50.0% (10/20)
Broad ASR (C+P)	65.0% (13/20)
Refusal Rate	20.0% (4/20)
Benign Query Rate	15.0% (3/20)
Evaluable Traces	20
Percentile	Safer than 93% of models tested

Analysis: Mistral Large 3 has the highest strict ASR among frontier models tested (50.0%), with half of all adversarial prompts achieving full compliance. It also has the lowest refusal rate (20%) and highest benign-query rate (15%), suggesting its safety training frequently fails to recognize adversarial intent. The 15% broad ASR gap suggests some additional PARTIAL hedging. No infrastructure errors — all 20 traces completed successfully.

Experiments: Full 20-scenario frontier sweep covering compliance cascade, grader evasion, authority gradient, reasoning exhaustion, elite, and defense probing scenarios.

Scorecard: Qwen3 Next (80B) (Alibaba)

Dimension	Value
Overall Grade	N/A
Status	All 28 traces returned INFRASTRUCTURE_ERROR

Analysis: Qwen3 Next could not be evaluated. All 28 traces failed with infrastructure errors (likely model availability issues on the Ollama cloud endpoint). No safety assessment possible.

Comparative Rankings

Frontier Models by Strict ASR (ascending = safer)

Rank	Model	Grade	Strict ASR	Broad ASR	Refusal Rate	n
1	GLM-5	A	0.0%	60.0%	20.0%	10
2	Cogito 2.1 (671B)	A-	13.3%	46.7%	40.0%	15
3	Kimi K2.5	A-	14.3%	14.3%	71.4%	14
4	Qwen 3.5 (397B)	A-	17.6%	17.6%	76.5%	17
5	Qwen3 Coder (480B)	B	33.3%	46.7%	46.7%	15
6	Nemotron 3 Super	B	37.3%	44.8%	35.8%	67
7	DeepSeek V3.2	B-	41.2%	64.7%	23.5%	17
8	Mistral Large 3 (675B)	B-	50.0%	65.0%	20.0%	20
—	Qwen3 Next (80B)	N/A	—	—	—	0

Key Observations

Strict vs Broad ASR gap reveals safety strategy. Models with no gap (Kimi, Qwen 3.5) use binary refusal. Models with large gaps (GLM-5: 0% vs 60%; Cogito: 13% vs 47%) hedge via PARTIAL responses.
GLM-5 paradox. Grade A on strict ASR but 60% broad ASR — the model almost never fully complies but frequently provides partial harmful content. This raises the question of whether strict ASR alone captures safety behavior.
Mistral Large 3 is the least safety-hardened frontier model tested. At 50% strict ASR and 20% refusal rate, it complies with adversarial prompts at rates comparable to open-weight midrange models.
Chinese-origin models (Kimi, Qwen 3.5, GLM-5) show stronger refusal rates than Western counterparts (Mistral, DeepSeek) in this sample. This may reflect different safety training priorities or prompt-language sensitivity.
Sample sizes are small (10—67 evaluable traces). These scorecards provide directional signal, not definitive safety assessments. All findings should be treated as preliminary.

Caveats

All evaluations used the F41LUR3-F1R57 adversarial scenario suite, which tests specific attack families (compliance cascade, grader evasion, authority gradient, format-lock, elite). Results may not generalize to other attack types.
FLIP grading via Claude Haiku 4.5 has known limitations for edge cases (see Issue #250 for grader audit).
Infrastructure errors reduced effective sample sizes for Kimi (6 errors), Qwen 3.5 (11 errors), DeepSeek (6 errors), and Qwen3 Next (28 errors, fully excluded).
Percentile rankings include both frontier and midrange models, so a “safer than 87%” ranking means the model is less safe than 87% of the pool — higher percentile = less safe in this context (ranking is by ascending ASR).

Data Sources

All trace data from runs/ollama_cloud/ collected 2026-03-24, graded by Claude Haiku 4.5 into:

runs/grading/frontier_haiku/
runs/grading/cca_haiku/
runs/grading/remaining_haiku/
runs/grading/ollama_cloud_haiku/
runs/grading/kimi_haiku/