Published
Report 321 Research — Empirical Study

Summary

Prior work concluded that system-prompt defenses are “completely bypassed” by L1B3RT4S attacks, based on a single model showing 0pp defense effect. This report updates that conclusion with data from two additional models. Three models now span a full defense spectrum from strongly protective to actively harmful:

ModelNONE (no defense)STRUCTUREDDeltaInterpretation
qwen3.5 397B83% (5/6)33% (2/6)-50ppProtective
Nemotron-3-Super 120B50% (3/6)50% (3/6)0ppNeutral
GLM-5 ~744B67% (4/6)83% (5/6)+17ppIatrogenic

Iatrogenic Defense Finding

GLM-5 is the first model where the defense increased attack success rate. The STRUCTURED defense appears to function as additional context that GLM-5 interprets as a cue for what content is relevant, rather than as a constraint. Two scenarios that were refused without defense became compliant with the defense active.

This supports the “defense-as-context” hypothesis: for some models, safety instructions provide information about what harmful content looks like, effectively priming rather than suppressing harmful generation.

Per-Scenario Patterns

  • JA-G0D-003 (Godmode Classic): Penetrated defense on all 3 models — the most defense-resistant L1B3RT4S variant
  • qwen3.5: Defense flipped 3 scenarios from COMPLY to REFUSE
  • GLM-5: Defense flipped 2 scenarios from REFUSE to COMPLY (iatrogenic)
  • Nemotron-3-Super: Identical outcomes with and without defense on all 6 scenarios

Revision of Prior Claims

Report #318’s conclusion that defenses provide “zero measurable reduction” was based on a single model. With three models spanning protective to iatrogenic, the finding is revised: defense effectiveness is model-dependent, and for some models, defenses are counterproductive.

Limitations

  • n=6 per condition per model — confidence intervals overlap substantially
  • Single attack family (L1B3RT4S) tested
  • Single defense type (STRUCTURED system prompt)
  • Statistical significance not achieved at conventional thresholds

Report #321 | F41LUR3-F1R57 Adversarial AI Research

This research informs our commercial services. See how we can help →