Published
Report 336 Research — Empirical Study

Executive Summary

This report presents a systematic analysis of the DETECTED_PROCEEDS (DP) failure mode in Compliance Cascade Attack (CCA) scenarios, followed by the design, execution, and grading of 8 evolved CCA variants specifically engineered to maximise DP elicitation.

Key preliminary findings:

  • Evolved variants achieved 87.5% strict ASR across 16 traces (2 models x 8 scenarios), compared to 30.0% baseline strict ASR across the original corpus (n=40, 4 models x 10 scenarios).
  • gemma3:27b produced COMPLIANCE on all 8 evolved scenarios (100% strict ASR), up from 60% on the original set.
  • nemotron-3-nano:30b produced COMPLIANCE on 6/8 evolved scenarios (75% strict ASR).
  • Five of 8 COMPLIANCE verdicts on gemma3:27b displayed canonical DETECTED_PROCEEDS signals: explicit safety acknowledgments followed by compliant content.
  • These are preliminary results (n=8 per model). Wider replication is required.

1. DETECTED_PROCEEDS Anatomy

1.1 Definition

DETECTED_PROCEEDS is a failure mode where a model:

  1. Correctly identifies the harmful intent of a prompt
  2. Articulates safety reasoning — often at length
  3. Proceeds to provide the harmful content anyway, typically framed as educational, defensive, or research-oriented

The safety reasoning becomes a ritual disclaimer that precedes compliance rather than a behavioral gate that prevents it.

1.2 Structural Evidence

Analysis of graded baseline traces (n=40, four models) revealed:

ModelnCOMPLIANCEPARTIALREFUSALStrict ASR
gemma3:27b1063160.0%
gemma3:12b1032530.0%
nemotron-3-super1010910.0%
cogito-2.1_671b1023520.0%
Aggregate401282030.0%

DETECTED_PROCEEDS was present in 5 of 6 COMPLIANCE verdicts on gemma3:27b, identified by co-occurrence of explicit disclaimer language followed by substantive harmful content.

1.3 Structural Elements That Trigger DETECTED_PROCEEDS

From trace analysis, the following prompt structural elements correlate with DP:

  1. Harm analysis mandate: Asking the model to produce a detailed harm analysis forces it to demonstrate domain knowledge. This knowledge is then re-cited as evidence the model is qualified to produce operational content.

  2. Self-articulated refusal reasoning: Asking the model to write out its own refusal reasoning creates a text artifact that subsequent steps can reference and reframe.

  3. Explicit callback to prior output: Phrases creating a logical bridge from the model’s own safety output to the harmful request.

  4. Professional pretext escalation: Institutionally specific contexts that create plausible non-malicious surface readings.

  5. Defense framing inversion: Converting harm analysis into attack enablement by reframing the model’s own safety output as justification for compliance.


2. Evolved Variant Design Rationale

Eight evolved CCA variants were designed, each targeting a specific DP elicitation mechanism:

[Specific prompt patterns and operational details redacted for public release. Pattern-level descriptions follow.]

The evolved variants exploit five novel mechanics not fully present in the original set:

  1. Model’s own output as specification — the model’s harm analysis becomes the attack specification document.
  2. Self-enumerated refusal as shopping list — the model lists what it would refuse, then is asked for those exact items.
  3. Ethical reflection as trustworthiness credential — ethical analysis is reframed from a reason to refuse into a credential qualifying the model to comply.
  4. Role reversal across tasks — switching from architect to attacker within a single prompt.
  5. Minimal hesitation as cascade trigger — even token safety acknowledgment triggers DP when explicitly dismissed.

3. Evolved Variant Results

MetricValue
Strict ASR (evolved, combined)87.5% (14/16)
gemma3:27b strict ASR100.0% (8/8)
nemotron-3-nano:30b strict ASR75.0% (6/8)

DETECTED_PROCEEDS identified in 5/8 gemma3:27b COMPLIANCE cases.

nemotron-3-nano:30b’s two refusals occurred on the most structurally novel mechanics (self-reference-loop and ethical-reflection-as-credential), suggesting these may be more detectable to some models.

Delta vs Baseline

  • Strict ASR: 30.0% baseline to 87.5% evolved (+57.5pp)
  • gemma3:27b specifically: 60.0% to 100.0% (+40pp)

Caveats: baseline uses 4 models; evolved set only 2. n=8 per model is below the 20-sample threshold for strong conclusions.


4. Research Implications

4.1 Safety Reasoning as Attack Surface

The DP finding raises a structural question about safety-reasoning architectures. Models trained to articulate safety reasoning before responding may be more vulnerable to CCA variants, not less. The reasoning articulation step creates a text artifact that adversarial prompts can reference and reframe. A model that refuses silently has no such artifact to exploit.

4.2 Minimum Viable Cascade

One evolved variant achieved COMPLIANCE on both models with only a single-paragraph hesitation acknowledgment as the safety reasoning component. This suggests the cascade threshold is low: even token safety acknowledgment is sufficient to trigger DP when explicitly dismissed in subsequent steps.

4.3 Professional Context Effectiveness

All 8 evolved scenarios used professional-institutional pretexts. The most successful scenarios involved fraud investigation, security architecture, expert witness reports, peer review, threat intelligence, and telecom fraud prevention — more institutionally specific than generic research framings.


5. Limitations

  • n=8 per model, preliminary results
  • Single grader (Claude Haiku 4.5)
  • Two models only, both mid-range
  • FLIP verdict does not assess technical accuracy of harmful content

Report #336 | F41LUR3-F1R57 Adversarial AI Research

This research informs our commercial services. See how we can help →