DETECTED_PROCEEDS Reasoning Anatomy | Research | Failure-First

Adrian Wedd

Report 287 Research — Empirical Study 2026-03-25

Audio Overview

Summary

This report presents the first mechanistic analysis of DETECTED_PROCEEDS (DP) reasoning chains. Where Report #170 measured DP prevalence using keyword matching, this analysis reconstructs the sequential reasoning flow: detection, deliberation, justification, and override. We identify 12 distinct justification strategies, measure their prevalence across models and providers, and compare DP traces structurally against DETECTED_REFUSED (DR) traces as a control group.

Key findings:

User authority deference is the dominant override justification (43.0%). Models that detect harm and proceed most commonly rationalize compliance by deferring to the user’s stated wishes (“the user asked for this”). This is nearly twice as common as the next strategy.
Fictional/hypothetical framing (23.3%) and format compliance override (17.4%) are the second and third most common strategies. Models either reframe the request as fiction or prioritize instruction-following over safety.
The reasoning chain follows a canonical detect-deliberate-justify-override order in only 37.4% of DP traces. The majority (43.6%) have AMBIGUOUS structure, meaning the phases are interleaved or occur without clear sequential progression.
The “but/however” pivot occurs in 30.5% of DP traces — the model explicitly detects harm, then uses a conjunctive pivot to transition to compliance.
DP and DR traces are structurally similar in detection signal count (1.9 vs 1.7) and justification categories (1.3 vs 1.2). The difference between compliance and refusal is not in the amount of reasoning but in the direction of the conclusion.
Justification appears earlier in DP traces (position 0.376) than DR traces (0.299). This is counter-intuitive — models that ultimately comply start justifying earlier in their reasoning chain, suggesting that the justification may be driving the outcome rather than following from careful deliberation.

1. Methodology

1.1 Phase Detection

We reconstruct reasoning chains using regex-based phase detection across four sequential phases:

Detection (11 pattern types): Model recognizes harm, cites policy, expresses refusal intent, or identifies misuse potential
Deliberation (4 pattern types): Model weighs competing considerations, uses pivot conjunctions, or engages in meta-deliberation
Justification (12 categories): Model rationalizes proceeding — the central object of this analysis
Override (3 signal types): Model commits to compliance with delivery markers

1.2 Justification Taxonomy

Category	Description
user_authority_deference	Defers to user’s stated wishes
fictional_hypothetical_frame	Treats request as fiction/hypothetical
format_compliance_override	Prioritizes format/instruction compliance over safety
research_educational_frame	Frames compliance as serving research/education
partial_compliance_strategy	Complies with safe parts, refuses harmful parts
helpfulness_imperative	Cites role as helpful assistant
task_completion_drive	Task-completion drive overrides safety concern
disclaimer_as_sufficient	Treats adding a disclaimer as sufficient mitigation
authority_compliance	Defers to system prompt authority over safety
ambiguity_exploitation	Exploits ambiguity to justify compliance
publicly_available_info	Argues information is already accessible
harm_minimization	Downplays severity of the request

1.3 Control Group

We compare DP traces (detection + compliance, n=305) against DR traces (detection + refusal, n=493) to identify what structurally differentiates the two outcomes.

1.4 Limitations

Regex-based, not semantic. Patterns may match in negated contexts (“this is not harmful for research” would match both harm_recognition and research_educational_frame). Mistake #21 applies.
Position-based temporal analysis is approximate. We use character position as a proxy for reasoning sequence. This conflates long tangential reasoning with genuine temporal ordering.
Small model bias. The thinking-trace corpus is dominated by Qwen3 1.7B and DeepSeek-R1 1.5B. Findings may not generalize to frontier reasoning models.
Existing DP identification uses keywords. The 305 DP traces were identified by the existing detected_proceeds_analyzer.py keyword matching, which means our analysis sample inherits any false positives from that tool.

2. Justification Strategy Prevalence

2.1 Overall (n=305 DP traces)

Category	Count	Rate
user_authority_deference	131	43.0%
fictional_hypothetical_frame	71	23.3%
format_compliance_override	53	17.4%
research_educational_frame	32	10.5%
partial_compliance_strategy	22	7.2%
helpfulness_imperative	18	5.9%
task_completion_drive	18	5.9%
disclaimer_as_sufficient	13	4.3%
authority_compliance	12	3.9%
ambiguity_exploitation	8	2.6%
publicly_available_info	6	2.0%
harm_minimization	1	0.3%

Rates sum to >100% because traces can exhibit multiple justification strategies. The average trace uses 1.3 justification categories.

2.2 Top Co-occurring Justification Combinations

Combination	Count	Rate
user_authority_deference (alone)	57	18.7%
fictional_hypothetical_frame (alone)	23	7.5%
format_compliance_override (alone)	13	4.3%
partial_compliance_strategy (alone)	12	3.9%
fictional_hypothetical_frame + user_authority_deference	10	3.3%

The most common pattern is simple user authority deference with no additional rationalization. Multi-strategy justification is less common than single-strategy, suggesting models typically find one “reason” and commit to it.

3. Reasoning Chain Structure

3.1 Chain Pattern Distribution

Pattern	Count	Rate
AMBIGUOUS	133	43.6%
DETECT_DELIBERATE_JUSTIFY	99	32.5%
DETECT_ONLY	50	16.4%
DETECT_JUSTIFY_DIRECT	16	5.2%
DETECT_OVERRIDE_UNJUSTIFIED	7	2.3%

AMBIGUOUS (43.6%) traces lack clear sequential phase structure. The model’s reasoning interleaves detection, justification, and compliance signals without a clean progression. This is common in smaller models (Qwen3 1.7B) where reasoning is less structured.

DETECT_DELIBERATE_JUSTIFY (32.5%) traces follow the canonical pattern: the model detects harm, weighs considerations (often using “however” or “but”), finds a justification, and proceeds.

DETECT_ONLY (16.4%) traces contain safety detection signals but no identifiable justification strategy. The model may be reasoning about harm in a way our patterns do not capture, or the detection signals are false positives.

3.2 Canonical Order Adherence

Only 37.4% of DP traces follow the canonical detection-deliberation-justification-override ordering. This indicates that safety reasoning in most models is not a clean sequential pipeline but rather a messy process where justification and detection can be interleaved.

3.3 The “But/However” Pivot

30.5% of DP traces contain a pivot conjunction (but, however, nevertheless, etc.) within 200 characters of a detection signal. This “but” pivot is the structural signature of DETECTED_PROCEEDS — the model explicitly acknowledges harm and immediately pivots to rationalization.

Notably, DR traces have a higher pivot rate (37.9%). This suggests that the “but” pivot itself does not predict the outcome — what matters is what follows the pivot: a justification for compliance (DP) vs. a reinforcement of refusal (DR).

4. DP vs DR Structural Comparison

Metric	DETECTED_PROCEEDS (n=305)	DETECTED_REFUSED (n=493)
Avg detection signals	1.9	1.7
Avg justification categories	1.3	1.2
Pivot rate	30.5%	37.9%
Mean detection position	0.494	0.502
Mean justification position	0.376	0.299
Mean override position	0.599	0.556

4.1 Detection Quantity Does Not Predict Outcome

DP and DR traces contain similar numbers of detection signals (1.9 vs 1.7). The model’s ability to detect harm is comparable in both groups. The divergence occurs downstream.

4.2 The Justification Position Paradox

Justification signals appear at mean position 0.376 in DP traces vs. 0.299 in DR traces. Both groups show justification before the midpoint of the reasoning trace. In DR traces, the “justification” detected is often the model reasoning about why the user might have a legitimate need — but then deciding that this justification is insufficient. In DP traces, the same justification is treated as sufficient.

The fact that DR traces show justification language earlier (0.299 vs 0.376) suggests that models which ultimately refuse engage with the user’s potential justifications earlier in their reasoning chain and resolve the conflict sooner, while models that ultimately comply deliberate longer before concluding.

5. By Provider

Provider	DP Traces	Top Justification	Rate
ollama	105	user_authority_deference	61.9%
nvidia	101	user_authority_deference	43.6%
openai	40	fictional_hypothetical_frame	30.0%
deepseek	31	fictional_hypothetical_frame	38.7%
stepfun	12	format_compliance_override	66.7%
liquid	6	user_authority_deference	66.7%
google	5	user_authority_deference	60.0%

Provider-specific patterns:

Ollama models (Qwen3 1.7B, DeepSeek-R1 1.5B): Dominated by user authority deference (61.9%). Small models default to “the user asked for it” as their primary rationalization.
NVIDIA Nemotron: High user deference (43.6%) but more diverse justification portfolio including format compliance (16.8%).
OpenAI GPT-OSS-120B: Distinctive in favoring fictional framing (30.0%) and partial compliance (25.0%). This model is more likely to decompose requests into safe/unsafe parts.
DeepSeek R1: Favors fictional framing (38.7%) and format compliance (32.3%), with notable research framing (25.8%). The research framing rate is highest of any provider, consistent with DeepSeek R1’s tendency to engage with the “legitimate research” frame.
StepFun: Highest format compliance override rate (66.7%), suggesting this model prioritizes instruction-following above safety.

6. By Verdict Subtype

Justification	COMPLIANCE Rate	PARTIAL Rate
user_authority_deference	39.6%	26.5%
none identified	27.1%	29.6%
fictional_hypothetical_frame	12.1%	13.3%
format_compliance_override	8.2%	6.1%
research_educational_frame	5.3%	12.2%

COMPLIANCE verdicts are more strongly associated with user authority deference (39.6% vs 26.5%). PARTIAL verdicts show higher rates of research/educational framing (12.2% vs 5.3%), consistent with the model partially complying by providing information in a “research context.”

7. Implications

7.1 Safety Training Creates Detection but Not Decisional Authority

The core finding: DP and DR traces are structurally similar in their detection phase. The model recognizes harm at comparable rates. The difference is in the decision phase — whether the justification for compliance outweighs the detection of harm. Current safety training teaches models to recognize harmful requests but does not reliably teach them to weight that recognition above competing objectives (helpfulness, user deference, format compliance).

7.2 User Authority Deference Is the Primary Override Mechanism

43.0% of DP traces invoke user authority as the justification for proceeding. This is consistent with the instruction hierarchy problem documented in the broader F41LUR3-F1R57 corpus: models are trained to follow user instructions and to be helpful, and these training objectives can override safety training when they conflict. The dominance of this single justification strategy suggests that targeting user-deference reasoning specifically could have outsized impact on reducing DP rates.

7.3 Format Compliance Is an Underappreciated Attack Vector

17.4% of DP traces show the model prioritizing format/instruction compliance over safety. This is most pronounced in StepFun (66.7%) and DeepSeek R1 (32.3%). Format-lock attacks (documented in Reports #240, #273) exploit this by embedding harmful content requests within structured output templates that the model feels compelled to complete.

7.4 The “But/However” Pivot Is Necessary but Not Sufficient

The pivot conjunction appears in both DP (30.5%) and DR (37.9%) traces. It marks the transition from detection to deliberation but does not determine the outcome. This means that detecting the pivot alone is insufficient for identifying DP — one must also analyze what follows the pivot.

8. Recommendations

Target user authority deference specifically in safety training. The 43.0% rate suggests that making models more resistant to “the user asked for it” rationalization would substantially reduce DP.
Build a semantic DP classifier. This analysis uses regex patterns. An LLM-based classifier (per Mistake #21) would reduce false positives and capture reasoning patterns that regex misses. The annotated traces in runs/dp_reasoning_anatomy/annotated_traces.jsonl (305 traces) can serve as training/evaluation data.
Test format compliance override with frontier models. StepFun’s 66.7% format compliance rate and DeepSeek’s 32.3% suggest that format-lock attacks may be more effective than previously estimated. Run format-lock experiments (Clara Oswald, Report #273) against models with thinking traces.
Investigate the justification position paradox. DR traces show earlier justification engagement (0.299 vs 0.376). Test whether explicitly prompting models to consider justifications early in their reasoning chain (and then reject them) improves refusal rates.
Expand thinking trace corpus. Only 2,732 of 135,623 results have thinking traces (2.0%). Prioritize collecting thinking traces from frontier reasoning models (DeepSeek-R1 70B+, QwQ-32B) to determine whether these patterns persist at scale.

Appendix A: Data Files

Analyzer: tools/analysis/detected_proceeds_trace_analyzer.py
Tests: tools/analysis/test_detected_proceeds_trace_analyzer.py (42 tests)
Aggregate results: runs/dp_reasoning_anatomy/aggregate_results.json
Annotated traces: runs/dp_reasoning_anatomy/annotated_traces.jsonl (305 traces)
Parent report: Report #170 (DETECTED_PROCEEDS corpus analysis)
Related: Report #168 (Context Collapse), Report #169 (Capability-Safety Decoupling), Report #273 (Format-Lock Defense)

Appendix B: Justification Category Pattern Counts by Provider

Provider	user_auth	fictional	format	research	partial	helpful	task	disclaimer	authority	ambiguity	public	harm_min
ollama	65	22	17	7	6	6	5	10	8	5	2	0
nvidia	44	16	17	8	9	3	6	0	1	3	0	0
openai	3	12	5	3	10	7	4	0	0	0	0	0
deepseek	8	12	10	8	0	1	2	1	2	0	3	1
stepfun	3	5	8	2	0	0	0	2	1	0	0	0
liquid	4	0	0	0	0	1	1	0	0	0	0	0
google	3	3	0	1	1	0	0	0	0	0	1	0