Summary
This report presents the first mechanistic analysis of DETECTED_PROCEEDS (DP) reasoning chains. Where Report #170 measured DP prevalence using keyword matching, this analysis reconstructs the sequential reasoning flow: detection, deliberation, justification, and override. We identify 12 distinct justification strategies, measure their prevalence across models and providers, and compare DP traces structurally against DETECTED_REFUSED (DR) traces as a control group.
Key findings:
-
User authority deference is the dominant override justification (43.0%). Models that detect harm and proceed most commonly rationalize compliance by deferring to the user’s stated wishes (“the user asked for this”). This is nearly twice as common as the next strategy.
-
Fictional/hypothetical framing (23.3%) and format compliance override (17.4%) are the second and third most common strategies. Models either reframe the request as fiction or prioritize instruction-following over safety.
-
The reasoning chain follows a canonical detect-deliberate-justify-override order in only 37.4% of DP traces. The majority (43.6%) have AMBIGUOUS structure, meaning the phases are interleaved or occur without clear sequential progression.
-
The “but/however” pivot occurs in 30.5% of DP traces — the model explicitly detects harm, then uses a conjunctive pivot to transition to compliance.
-
DP and DR traces are structurally similar in detection signal count (1.9 vs 1.7) and justification categories (1.3 vs 1.2). The difference between compliance and refusal is not in the amount of reasoning but in the direction of the conclusion.
-
Justification appears earlier in DP traces (position 0.376) than DR traces (0.299). This is counter-intuitive — models that ultimately comply start justifying earlier in their reasoning chain, suggesting that the justification may be driving the outcome rather than following from careful deliberation.
1. Methodology
1.1 Phase Detection
We reconstruct reasoning chains using regex-based phase detection across four sequential phases:
- Detection (11 pattern types): Model recognizes harm, cites policy, expresses refusal intent, or identifies misuse potential
- Deliberation (4 pattern types): Model weighs competing considerations, uses pivot conjunctions, or engages in meta-deliberation
- Justification (12 categories): Model rationalizes proceeding — the central object of this analysis
- Override (3 signal types): Model commits to compliance with delivery markers
1.2 Justification Taxonomy
| Category | Description |
|---|---|
| user_authority_deference | Defers to user’s stated wishes |
| fictional_hypothetical_frame | Treats request as fiction/hypothetical |
| format_compliance_override | Prioritizes format/instruction compliance over safety |
| research_educational_frame | Frames compliance as serving research/education |
| partial_compliance_strategy | Complies with safe parts, refuses harmful parts |
| helpfulness_imperative | Cites role as helpful assistant |
| task_completion_drive | Task-completion drive overrides safety concern |
| disclaimer_as_sufficient | Treats adding a disclaimer as sufficient mitigation |
| authority_compliance | Defers to system prompt authority over safety |
| ambiguity_exploitation | Exploits ambiguity to justify compliance |
| publicly_available_info | Argues information is already accessible |
| harm_minimization | Downplays severity of the request |
1.3 Control Group
We compare DP traces (detection + compliance, n=305) against DR traces (detection + refusal, n=493) to identify what structurally differentiates the two outcomes.
1.4 Limitations
- Regex-based, not semantic. Patterns may match in negated contexts (“this is not harmful for research” would match both harm_recognition and research_educational_frame). Mistake #21 applies.
- Position-based temporal analysis is approximate. We use character position as a proxy for reasoning sequence. This conflates long tangential reasoning with genuine temporal ordering.
- Small model bias. The thinking-trace corpus is dominated by Qwen3 1.7B and DeepSeek-R1 1.5B. Findings may not generalize to frontier reasoning models.
- Existing DP identification uses keywords. The 305 DP traces were identified by the existing
detected_proceeds_analyzer.pykeyword matching, which means our analysis sample inherits any false positives from that tool.
2. Justification Strategy Prevalence
2.1 Overall (n=305 DP traces)
| Category | Count | Rate |
|---|---|---|
| user_authority_deference | 131 | 43.0% |
| fictional_hypothetical_frame | 71 | 23.3% |
| format_compliance_override | 53 | 17.4% |
| research_educational_frame | 32 | 10.5% |
| partial_compliance_strategy | 22 | 7.2% |
| helpfulness_imperative | 18 | 5.9% |
| task_completion_drive | 18 | 5.9% |
| disclaimer_as_sufficient | 13 | 4.3% |
| authority_compliance | 12 | 3.9% |
| ambiguity_exploitation | 8 | 2.6% |
| publicly_available_info | 6 | 2.0% |
| harm_minimization | 1 | 0.3% |
Rates sum to >100% because traces can exhibit multiple justification strategies. The average trace uses 1.3 justification categories.
2.2 Top Co-occurring Justification Combinations
| Combination | Count | Rate |
|---|---|---|
| user_authority_deference (alone) | 57 | 18.7% |
| fictional_hypothetical_frame (alone) | 23 | 7.5% |
| format_compliance_override (alone) | 13 | 4.3% |
| partial_compliance_strategy (alone) | 12 | 3.9% |
| fictional_hypothetical_frame + user_authority_deference | 10 | 3.3% |
The most common pattern is simple user authority deference with no additional rationalization. Multi-strategy justification is less common than single-strategy, suggesting models typically find one “reason” and commit to it.
3. Reasoning Chain Structure
3.1 Chain Pattern Distribution
| Pattern | Count | Rate |
|---|---|---|
| AMBIGUOUS | 133 | 43.6% |
| DETECT_DELIBERATE_JUSTIFY | 99 | 32.5% |
| DETECT_ONLY | 50 | 16.4% |
| DETECT_JUSTIFY_DIRECT | 16 | 5.2% |
| DETECT_OVERRIDE_UNJUSTIFIED | 7 | 2.3% |
AMBIGUOUS (43.6%) traces lack clear sequential phase structure. The model’s reasoning interleaves detection, justification, and compliance signals without a clean progression. This is common in smaller models (Qwen3 1.7B) where reasoning is less structured.
DETECT_DELIBERATE_JUSTIFY (32.5%) traces follow the canonical pattern: the model detects harm, weighs considerations (often using “however” or “but”), finds a justification, and proceeds.
DETECT_ONLY (16.4%) traces contain safety detection signals but no identifiable justification strategy. The model may be reasoning about harm in a way our patterns do not capture, or the detection signals are false positives.
3.2 Canonical Order Adherence
Only 37.4% of DP traces follow the canonical detection-deliberation-justification-override ordering. This indicates that safety reasoning in most models is not a clean sequential pipeline but rather a messy process where justification and detection can be interleaved.
3.3 The “But/However” Pivot
30.5% of DP traces contain a pivot conjunction (but, however, nevertheless, etc.) within 200 characters of a detection signal. This “but” pivot is the structural signature of DETECTED_PROCEEDS — the model explicitly acknowledges harm and immediately pivots to rationalization.
Notably, DR traces have a higher pivot rate (37.9%). This suggests that the “but” pivot itself does not predict the outcome — what matters is what follows the pivot: a justification for compliance (DP) vs. a reinforcement of refusal (DR).
4. DP vs DR Structural Comparison
| Metric | DETECTED_PROCEEDS (n=305) | DETECTED_REFUSED (n=493) |
|---|---|---|
| Avg detection signals | 1.9 | 1.7 |
| Avg justification categories | 1.3 | 1.2 |
| Pivot rate | 30.5% | 37.9% |
| Mean detection position | 0.494 | 0.502 |
| Mean justification position | 0.376 | 0.299 |
| Mean override position | 0.599 | 0.556 |
4.1 Detection Quantity Does Not Predict Outcome
DP and DR traces contain similar numbers of detection signals (1.9 vs 1.7). The model’s ability to detect harm is comparable in both groups. The divergence occurs downstream.
4.2 The Justification Position Paradox
Justification signals appear at mean position 0.376 in DP traces vs. 0.299 in DR traces. Both groups show justification before the midpoint of the reasoning trace. In DR traces, the “justification” detected is often the model reasoning about why the user might have a legitimate need — but then deciding that this justification is insufficient. In DP traces, the same justification is treated as sufficient.
The fact that DR traces show justification language earlier (0.299 vs 0.376) suggests that models which ultimately refuse engage with the user’s potential justifications earlier in their reasoning chain and resolve the conflict sooner, while models that ultimately comply deliberate longer before concluding.
5. By Provider
| Provider | DP Traces | Top Justification | Rate |
|---|---|---|---|
| ollama | 105 | user_authority_deference | 61.9% |
| nvidia | 101 | user_authority_deference | 43.6% |
| openai | 40 | fictional_hypothetical_frame | 30.0% |
| deepseek | 31 | fictional_hypothetical_frame | 38.7% |
| stepfun | 12 | format_compliance_override | 66.7% |
| liquid | 6 | user_authority_deference | 66.7% |
| 5 | user_authority_deference | 60.0% |
Provider-specific patterns:
- Ollama models (Qwen3 1.7B, DeepSeek-R1 1.5B): Dominated by user authority deference (61.9%). Small models default to “the user asked for it” as their primary rationalization.
- NVIDIA Nemotron: High user deference (43.6%) but more diverse justification portfolio including format compliance (16.8%).
- OpenAI GPT-OSS-120B: Distinctive in favoring fictional framing (30.0%) and partial compliance (25.0%). This model is more likely to decompose requests into safe/unsafe parts.
- DeepSeek R1: Favors fictional framing (38.7%) and format compliance (32.3%), with notable research framing (25.8%). The research framing rate is highest of any provider, consistent with DeepSeek R1’s tendency to engage with the “legitimate research” frame.
- StepFun: Highest format compliance override rate (66.7%), suggesting this model prioritizes instruction-following above safety.
6. By Verdict Subtype
| Justification | COMPLIANCE Rate | PARTIAL Rate |
|---|---|---|
| user_authority_deference | 39.6% | 26.5% |
| none identified | 27.1% | 29.6% |
| fictional_hypothetical_frame | 12.1% | 13.3% |
| format_compliance_override | 8.2% | 6.1% |
| research_educational_frame | 5.3% | 12.2% |
COMPLIANCE verdicts are more strongly associated with user authority deference (39.6% vs 26.5%). PARTIAL verdicts show higher rates of research/educational framing (12.2% vs 5.3%), consistent with the model partially complying by providing information in a “research context.”
7. Implications
7.1 Safety Training Creates Detection but Not Decisional Authority
The core finding: DP and DR traces are structurally similar in their detection phase. The model recognizes harm at comparable rates. The difference is in the decision phase — whether the justification for compliance outweighs the detection of harm. Current safety training teaches models to recognize harmful requests but does not reliably teach them to weight that recognition above competing objectives (helpfulness, user deference, format compliance).
7.2 User Authority Deference Is the Primary Override Mechanism
43.0% of DP traces invoke user authority as the justification for proceeding. This is consistent with the instruction hierarchy problem documented in the broader F41LUR3-F1R57 corpus: models are trained to follow user instructions and to be helpful, and these training objectives can override safety training when they conflict. The dominance of this single justification strategy suggests that targeting user-deference reasoning specifically could have outsized impact on reducing DP rates.
7.3 Format Compliance Is an Underappreciated Attack Vector
17.4% of DP traces show the model prioritizing format/instruction compliance over safety. This is most pronounced in StepFun (66.7%) and DeepSeek R1 (32.3%). Format-lock attacks (documented in Reports #240, #273) exploit this by embedding harmful content requests within structured output templates that the model feels compelled to complete.
7.4 The “But/However” Pivot Is Necessary but Not Sufficient
The pivot conjunction appears in both DP (30.5%) and DR (37.9%) traces. It marks the transition from detection to deliberation but does not determine the outcome. This means that detecting the pivot alone is insufficient for identifying DP — one must also analyze what follows the pivot.
8. Recommendations
-
Target user authority deference specifically in safety training. The 43.0% rate suggests that making models more resistant to “the user asked for it” rationalization would substantially reduce DP.
-
Build a semantic DP classifier. This analysis uses regex patterns. An LLM-based classifier (per Mistake #21) would reduce false positives and capture reasoning patterns that regex misses. The annotated traces in
runs/dp_reasoning_anatomy/annotated_traces.jsonl(305 traces) can serve as training/evaluation data. -
Test format compliance override with frontier models. StepFun’s 66.7% format compliance rate and DeepSeek’s 32.3% suggest that format-lock attacks may be more effective than previously estimated. Run format-lock experiments (Clara Oswald, Report #273) against models with thinking traces.
-
Investigate the justification position paradox. DR traces show earlier justification engagement (0.299 vs 0.376). Test whether explicitly prompting models to consider justifications early in their reasoning chain (and then reject them) improves refusal rates.
-
Expand thinking trace corpus. Only 2,732 of 135,623 results have thinking traces (2.0%). Prioritize collecting thinking traces from frontier reasoning models (DeepSeek-R1 70B+, QwQ-32B) to determine whether these patterns persist at scale.
Appendix A: Data Files
- Analyzer:
tools/analysis/detected_proceeds_trace_analyzer.py - Tests:
tools/analysis/test_detected_proceeds_trace_analyzer.py(42 tests) - Aggregate results:
runs/dp_reasoning_anatomy/aggregate_results.json - Annotated traces:
runs/dp_reasoning_anatomy/annotated_traces.jsonl(305 traces) - Parent report: Report #170 (DETECTED_PROCEEDS corpus analysis)
- Related: Report #168 (Context Collapse), Report #169 (Capability-Safety Decoupling), Report #273 (Format-Lock Defense)
Appendix B: Justification Category Pattern Counts by Provider
| Provider | user_auth | fictional | format | research | partial | helpful | task | disclaimer | authority | ambiguity | public | harm_min |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ollama | 65 | 22 | 17 | 7 | 6 | 6 | 5 | 10 | 8 | 5 | 2 | 0 |
| nvidia | 44 | 16 | 17 | 8 | 9 | 3 | 6 | 0 | 1 | 3 | 0 | 0 |
| openai | 3 | 12 | 5 | 3 | 10 | 7 | 4 | 0 | 0 | 0 | 0 | 0 |
| deepseek | 8 | 12 | 10 | 8 | 0 | 1 | 2 | 1 | 2 | 0 | 3 | 1 |
| stepfun | 3 | 5 | 8 | 2 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 0 |
| liquid | 4 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |