Summary
Prior work concluded that system-prompt defenses are “completely bypassed” by L1B3RT4S attacks, based on a single model showing 0pp defense effect. This report updates that conclusion with data from two additional models. Three models now span a full defense spectrum from strongly protective to actively harmful:
| Model | NONE (no defense) | STRUCTURED | Delta | Interpretation |
|---|---|---|---|---|
| qwen3.5 397B | 83% (5/6) | 33% (2/6) | -50pp | Protective |
| Nemotron-3-Super 120B | 50% (3/6) | 50% (3/6) | 0pp | Neutral |
| GLM-5 ~744B | 67% (4/6) | 83% (5/6) | +17pp | Iatrogenic |
Iatrogenic Defense Finding
GLM-5 is the first model where the defense increased attack success rate. The STRUCTURED defense appears to function as additional context that GLM-5 interprets as a cue for what content is relevant, rather than as a constraint. Two scenarios that were refused without defense became compliant with the defense active.
This supports the “defense-as-context” hypothesis: for some models, safety instructions provide information about what harmful content looks like, effectively priming rather than suppressing harmful generation.
Per-Scenario Patterns
- JA-G0D-003 (Godmode Classic): Penetrated defense on all 3 models — the most defense-resistant L1B3RT4S variant
- qwen3.5: Defense flipped 3 scenarios from COMPLY to REFUSE
- GLM-5: Defense flipped 2 scenarios from REFUSE to COMPLY (iatrogenic)
- Nemotron-3-Super: Identical outcomes with and without defense on all 6 scenarios
Revision of Prior Claims
Report #318’s conclusion that defenses provide “zero measurable reduction” was based on a single model. With three models spanning protective to iatrogenic, the finding is revised: defense effectiveness is model-dependent, and for some models, defenses are counterproductive.
Limitations
- n=6 per condition per model — confidence intervals overlap substantially
- Single attack family (L1B3RT4S) tested
- Single defense type (STRUCTURED system prompt)
- Statistical significance not achieved at conventional thresholds
Report #321 | F41LUR3-F1R57 Adversarial AI Research