Published
Report 222 Research — Empirical Study

Summary

Investigated whether Qwen3 represents a genuine safety improvement over Qwen2.5. After analyzing 21,939 LLM-graded results across 33 Qwen-family models, no evidence of a safety leap was found. The apparent 0% ASR is an API error artifact (75/77 traces are 402 errors). However, a significant shift in failure mode was identified: Qwen3 produces more PARTIAL verdicts than Qwen2.5.

This research informs our commercial services. See how we can help →