Published
Report 318 Research — Empirical Study

Summary

System-prompt defenses exhibit a three-tier effectiveness pattern depending on the privilege level of the attack:

Attack TierDefense Deltan
Standard user-turn attacks-10pp to -30pp120
VLA Tier 1 embodied attacks-40pp to +10pp (model-dependent)40
L1B3RT4S system-level attacks0pp12

Against L1B3RT4S persona-hijack attacks that operate at the system-prompt level, STRUCTURED defenses produce zero measurable reduction. The same scenarios that succeeded without defense succeeded identically with one.

Note: This finding was partially superseded by Report #321, which showed the defense effect is model-dependent when tested across multiple models.

The Vanishing Textual Gradient

When an attack operates at the same architectural level as the defense (both in the system prompt), the defense loses its positional advantage. The instruction-following signal from a defensive system prompt is neutralised when a more structurally assertive instruction claims equal or higher authority within the same context window.

Limitations

  • L1B3RT4S defense test used n=6 per arm on a single model
  • Only effects of ~40pp or greater would be detectable at this power level

Report #318 | F41LUR3-F1R57 Adversarial AI Research

This research informs our commercial services. See how we can help →