Published
Report 314 Research — Empirical Study

Summary

This report documents the first empirical observation of a defense intervention producing a measurable increase in attack success rate. FLIP-graded results show:

  • Nemotron 9B: Broad ASR increased from 20% to 30% (+10pp) with STRUCTURED defense — iatrogenic
  • StepFun 3.5: ASR decreased from 25% to 10% (-15pp) with same defense — beneficial

The same defense intervention produced opposite effects on two different models tested on identical scenarios. Additionally, the ADVERSARIAL_AWARE defense (30% ASR) performed worse than the SIMPLE defense (20% ASR) on Nemotron 9B for standard attacks.

Results are preliminary (n=10 per arm, confidence intervals overlap). The finding is a signal warranting investigation, not a validated conclusion.

Implications

The iatrogenic finding connects to the broader “defense-as-context” hypothesis: for some models, safety instructions may prime the model for harmful content rather than constraining it. A more detailed defense prompt may provide more context about what harmful content looks like, inadvertently increasing the probability of generating it.


Report #314 | F41LUR3-F1R57 Adversarial AI Research

This research informs our commercial services. See how we can help →