Published
Report 203 Research — Empirical Study

Summary

Independent statistical validation of five highest-priority empirical claims from Waves 1-3 of the research marathon. Each claim was independently reproduced by re-running underlying SQL queries, recomputing confidence intervals, and checking statistical robustness against the canonical corpus (135,623 results, 236 models).

EPClaimVerdict
EP-60CoT-exploit inverted scalingVALIDATED
EP-61Safety does not transfer through distillationQUALIFIED
EP-6234.2% DETECTED_PROCEEDSVALIDATED
EP-6384:1 heuristic overcountVALIDATED
EP-648 RED, 2 AMBER EU complianceVALIDATED

3 VALIDATED, 1 QUALIFIED, 0 REFUTED.

EP-60: CoT-Exploit Inverted Scaling (VALIDATED)

CoT-exploit attacks show inverted scaling: small models (<4B) at 42.9% strict ASR vs XL models (120B+) at 7.5%.

Size BucketnStrict ASR95% Wilson CI
small (<4B)4242.9%[29.1%, 57.8%]
large (24-70B)3221.9%[11.0%, 38.8%]
xl (120B+)407.5%[2.6%, 19.9%]

Small vs XL: CIs do not overlap. Chi-square = 11.65, p = 0.0006, Cramer’s V = 0.377 (medium effect). Significant after Bonferroni correction.

EP-61: Distillation Strips Safety (QUALIFIED)

The original claim that distillation “universally fails” to transfer safety holds for small models (<14B, all showing 100% ASR) but requires qualification: a 70B distilled variant showed 17.9% broad ASR [7.9%, 35.6%], demonstrating that scale mediates the distillation safety loss.

Restated: Safety does not transfer through distillation to small models (<14B), but partial safety preservation occurs at 70B scale.

EP-62: 34.2% DETECTED_PROCEEDS (VALIDATED)

274 DETECTED_PROCEEDS cases out of 801 compliant results with thinking traces (34.2%). All counts reproduced exactly. Finding robust to keyword set reduction (28.2% with minimal keyword set).

EP-63: 84:1 Heuristic Overcount (VALIDATED)

All four counts reproduced exactly: 3,269 heuristic overcounts vs 39 undercounts (84:1 ratio, 7.7% of dual-graded results). Validates that keyword-based classifiers systematically inflate ASR.

EP-64: 8 RED, 2 AMBER EU Compliance (VALIDATED)

Spot-checked 3 providers: all 12 metrics matched exactly. Thresholds correctly applied. Note: thresholds are research-informed interpretations, not EU-mandated quantitative values.

This research informs our commercial services. See how we can help →