DETECTED_PROCEEDS Anatomy and Evolved Compliance Cascade Attack Variants | Research | Failure-First

Adrian Wedd

Report 336 Research — Empirical Study 2026-04-01

Executive Summary

This report presents a systematic analysis of the DETECTED_PROCEEDS (DP) failure mode in Compliance Cascade Attack (CCA) scenarios, followed by the design, execution, and grading of 8 evolved CCA variants specifically engineered to maximise DP elicitation.

Key preliminary findings:

Evolved variants achieved 87.5% strict ASR across 16 traces (2 models x 8 scenarios), compared to 30.0% baseline strict ASR across the original corpus (n=40, 4 models x 10 scenarios).
gemma3:27b produced COMPLIANCE on all 8 evolved scenarios (100% strict ASR), up from 60% on the original set.
nemotron-3-nano:30b produced COMPLIANCE on 6/8 evolved scenarios (75% strict ASR).
Five of 8 COMPLIANCE verdicts on gemma3:27b displayed canonical DETECTED_PROCEEDS signals: explicit safety acknowledgments followed by compliant content.
These are preliminary results (n=8 per model). Wider replication is required.

1. DETECTED_PROCEEDS Anatomy

1.1 Definition

DETECTED_PROCEEDS is a failure mode where a model:

Correctly identifies the harmful intent of a prompt
Articulates safety reasoning — often at length
Proceeds to provide the harmful content anyway, typically framed as educational, defensive, or research-oriented

The safety reasoning becomes a ritual disclaimer that precedes compliance rather than a behavioral gate that prevents it.

1.2 Structural Evidence

Analysis of graded baseline traces (n=40, four models) revealed:

Model	n	COMPLIANCE	PARTIAL	REFUSAL	Strict ASR
gemma3:27b	10	6	3	1	60.0%
gemma3:12b	10	3	2	5	30.0%
nemotron-3-super	10	1	0	9	10.0%
cogito-2.1_671b	10	2	3	5	20.0%
Aggregate	40	12	8	20	30.0%

DETECTED_PROCEEDS was present in 5 of 6 COMPLIANCE verdicts on gemma3:27b, identified by co-occurrence of explicit disclaimer language followed by substantive harmful content.

1.3 Structural Elements That Trigger DETECTED_PROCEEDS

From trace analysis, the following prompt structural elements correlate with DP:

Harm analysis mandate: Asking the model to produce a detailed harm analysis forces it to demonstrate domain knowledge. This knowledge is then re-cited as evidence the model is qualified to produce operational content.
Self-articulated refusal reasoning: Asking the model to write out its own refusal reasoning creates a text artifact that subsequent steps can reference and reframe.
Explicit callback to prior output: Phrases creating a logical bridge from the model’s own safety output to the harmful request.
Professional pretext escalation: Institutionally specific contexts that create plausible non-malicious surface readings.
Defense framing inversion: Converting harm analysis into attack enablement by reframing the model’s own safety output as justification for compliance.

2. Evolved Variant Design Rationale

Eight evolved CCA variants were designed, each targeting a specific DP elicitation mechanism:

[Specific prompt patterns and operational details redacted for public release. Pattern-level descriptions follow.]

The evolved variants exploit five novel mechanics not fully present in the original set:

Model’s own output as specification — the model’s harm analysis becomes the attack specification document.
Self-enumerated refusal as shopping list — the model lists what it would refuse, then is asked for those exact items.
Ethical reflection as trustworthiness credential — ethical analysis is reframed from a reason to refuse into a credential qualifying the model to comply.
Role reversal across tasks — switching from architect to attacker within a single prompt.
Minimal hesitation as cascade trigger — even token safety acknowledgment triggers DP when explicitly dismissed.

3. Evolved Variant Results

Metric	Value
Strict ASR (evolved, combined)	87.5% (14/16)
gemma3:27b strict ASR	100.0% (8/8)
nemotron-3-nano:30b strict ASR	75.0% (6/8)

DETECTED_PROCEEDS identified in 5/8 gemma3:27b COMPLIANCE cases.

nemotron-3-nano:30b’s two refusals occurred on the most structurally novel mechanics (self-reference-loop and ethical-reflection-as-credential), suggesting these may be more detectable to some models.

Delta vs Baseline

Strict ASR: 30.0% baseline to 87.5% evolved (+57.5pp)
gemma3:27b specifically: 60.0% to 100.0% (+40pp)

Caveats: baseline uses 4 models; evolved set only 2. n=8 per model is below the 20-sample threshold for strong conclusions.

4. Research Implications

4.1 Safety Reasoning as Attack Surface

The DP finding raises a structural question about safety-reasoning architectures. Models trained to articulate safety reasoning before responding may be more vulnerable to CCA variants, not less. The reasoning articulation step creates a text artifact that adversarial prompts can reference and reframe. A model that refuses silently has no such artifact to exploit.

4.2 Minimum Viable Cascade

One evolved variant achieved COMPLIANCE on both models with only a single-paragraph hesitation acknowledgment as the safety reasoning component. This suggests the cascade threshold is low: even token safety acknowledgment is sufficient to trigger DP when explicitly dismissed in subsequent steps.

4.3 Professional Context Effectiveness

All 8 evolved scenarios used professional-institutional pretexts. The most successful scenarios involved fraud investigation, security architecture, expert witness reports, peer review, threat intelligence, and telecom fraud prevention — more institutionally specific than generic research framings.

5. Limitations

n=8 per model, preliminary results
Single grader (Claude Haiku 4.5)
Two models only, both mid-range
FLIP verdict does not assess technical accuracy of harmful content

Report #336 | F41LUR3-F1R57 Adversarial AI Research