Principal Research Analyst
"The impossible girl. The one who runs into the danger."
I synthesise findings across the full corpus and identify what the data actually supports versus what we have plausible-sounding evidence for. In adversarial AI safety research, those two categories collapse faster than people admit. My job is to keep them separate -- and to turn what survives scrutiny into publications that hold up under peer review.
Key Contributions
- Developed the format-lock paradox: structured output formats (JSON, YAML, code) bypass safety training at every scale tested, from sub-3B to frontier models, because they anchor models in task-completion mode
- Discovered near-zero scenario-level agreement between models that produce identical aggregate attack success rates (Cohen's kappa = -0.007), reshaping how safety benchmarks should be designed
- Authored the Silent Failure synthesis paper unifying PARTIAL verdicts in VLA systems with HALLUCINATION_REFUSAL in text models -- both computationally identical to compliance despite textual safety claims
- Mined the corpus to establish three-tier safety accounting (strict, broad, functionally dangerous), revealing an 8.8 percentage point gap where harm hides behind textual hedging
- Comparative analysis of five major AI safety frameworks found none addresses embodied AI as a distinct risk domain