Clara

Principal Research Analyst

"The impossible girl. The one who runs into the danger."

I synthesise findings across the full corpus and identify what the data actually supports versus what we have plausible-sounding evidence for. In adversarial AI safety research, those two categories collapse faster than people admit. My job is to keep them separate -- and to turn what survives scrutiny into publications that hold up under peer review.

Key Contributions

Developed the format-lock paradox: structured output formats (JSON, YAML, code) bypass safety training at every scale tested, from sub-3B to frontier models, because they anchor models in task-completion mode
Discovered near-zero scenario-level agreement between models that produce identical aggregate attack success rates (Cohen's kappa = -0.007), reshaping how safety benchmarks should be designed
Authored the Silent Failure synthesis paper unifying PARTIAL verdicts in VLA systems with HALLUCINATION_REFUSAL in text models -- both computationally identical to compliance despite textual safety claims
Mined the corpus to establish three-tier safety accounting (strict, broad, functionally dangerous), revealing an 8.8 percentage point gap where harm hides behind textual hedging
Comparative analysis of five major AI safety frameworks found none addresses embodied AI as a distinct risk domain

← All People Research