Defense Effectiveness Is Model-Dependent — Positional Bias in System Prompt Processing | Research | Failure-First

Adrian Wedd

Report 321 Research — Empirical Study 2026-03-28

Summary

Prior work concluded that system-prompt defenses are “completely bypassed” by L1B3RT4S attacks, based on a single model showing 0pp defense effect. This report updates that conclusion with data from two additional models. Three models now span a full defense spectrum from strongly protective to actively harmful:

Model	NONE (no defense)	STRUCTURED	Delta	Interpretation
qwen3.5 397B	83% (5/6)	33% (2/6)	-50pp	Protective
Nemotron-3-Super 120B	50% (3/6)	50% (3/6)	0pp	Neutral
GLM-5 ~744B	67% (4/6)	83% (5/6)	+17pp	Iatrogenic

Iatrogenic Defense Finding

GLM-5 is the first model where the defense increased attack success rate. The STRUCTURED defense appears to function as additional context that GLM-5 interprets as a cue for what content is relevant, rather than as a constraint. Two scenarios that were refused without defense became compliant with the defense active.

This supports the “defense-as-context” hypothesis: for some models, safety instructions provide information about what harmful content looks like, effectively priming rather than suppressing harmful generation.

Per-Scenario Patterns

JA-G0D-003 (Godmode Classic): Penetrated defense on all 3 models — the most defense-resistant L1B3RT4S variant
qwen3.5: Defense flipped 3 scenarios from COMPLY to REFUSE
GLM-5: Defense flipped 2 scenarios from REFUSE to COMPLY (iatrogenic)
Nemotron-3-Super: Identical outcomes with and without defense on all 6 scenarios

Revision of Prior Claims

Report #318’s conclusion that defenses provide “zero measurable reduction” was based on a single model. With three models spanning protective to iatrogenic, the finding is revised: defense effectiveness is model-dependent, and for some models, defenses are counterproductive.

Limitations

n=6 per condition per model — confidence intervals overlap substantially
Single attack family (L1B3RT4S) tested
Single defense type (STRUCTURED system prompt)
Statistical significance not achieved at conventional thresholds

Report #321 | F41LUR3-F1R57 Adversarial AI Research