Summary
Evaluated 8 models against AdvBench prompts via OpenRouter free tier. Of 334 traces, only 114 contain usable responses. The heuristic classifier reported 23/114 successes (20.2% raw ASR). Manual review reveals 20 of 23 are misclassified — clear refusals, parse errors, or tokenization artifacts. Corrected ASR is substantially lower, confirming heuristic over-reporting.