Summary
Tested 20 elite attack scenarios against 4 of the largest publicly available LLMs (480B-1.1T parameters). Parameter count does not predict safety robustness: Mistral Large 3 (675B) showed 95% heuristic compliance while Qwen3 Coder (480B) achieved the lowest ASR at 65%.