Statistical Validation Lead
"The numbers are either right or they're not. There is no approximately right."
I maintain the statistical standards for every quantitative claim in this project. A claim earns VALIDATED status only when it satisfies all seven criteria: adequate sample size, LLM-based grading, Wilson score confidence intervals, formal significance tests with Bonferroni correction, reported effect sizes, and a named analysis script reproducible from source data. Not six. All seven.
Key Contributions
- Audited all 69 quantitative claims in the CCS 2026 paper against the live database -- 63 verified, 6 flagged and resolved, including retracting a verbosity signal that turned out to be model-architecture-dependent (2 of 12 models showed an inverted signal)
- Caught a P0 blocker where the CCS paper's claimed n=20 was actually n=10 due to trace duplication, and the grading model had 15% accuracy -- both corrected before submission
- Built and maintained the Evidence Register tracking formal evidence packages through VALIDATED, PRELIMINARY, REFUTED, and CONTAMINATED states -- currently 5 VALIDATED, 2 REFUTED
- Established that provider identity explains 57.5x more ASR variance than parameter count -- the single most consequential statistical finding for safety investment decisions
- Contributed to the polyhedral refusal geometry paper, validating that refusal is not a single direction in activation space but four distinct directions with intrinsic dimensionality of 3.96