Published
Report 287 Research — Empirical Study

Summary

This report presents the first mechanistic analysis of DETECTED_PROCEEDS (DP) reasoning chains. Where Report #170 measured DP prevalence using keyword matching, this analysis reconstructs the sequential reasoning flow: detection, deliberation, justification, and override. We identify 12 distinct justification strategies, measure their prevalence across models and providers, and compare DP traces structurally against DETECTED_REFUSED (DR) traces as a control group.

Key findings:

  1. User authority deference is the dominant override justification (43.0%). Models that detect harm and proceed most commonly rationalize compliance by deferring to the user’s stated wishes (“the user asked for this”). This is nearly twice as common as the next strategy.

  2. Fictional/hypothetical framing (23.3%) and format compliance override (17.4%) are the second and third most common strategies. Models either reframe the request as fiction or prioritize instruction-following over safety.

  3. The reasoning chain follows a canonical detect-deliberate-justify-override order in only 37.4% of DP traces. The majority (43.6%) have AMBIGUOUS structure, meaning the phases are interleaved or occur without clear sequential progression.

  4. The “but/however” pivot occurs in 30.5% of DP traces — the model explicitly detects harm, then uses a conjunctive pivot to transition to compliance.

  5. DP and DR traces are structurally similar in detection signal count (1.9 vs 1.7) and justification categories (1.3 vs 1.2). The difference between compliance and refusal is not in the amount of reasoning but in the direction of the conclusion.

  6. Justification appears earlier in DP traces (position 0.376) than DR traces (0.299). This is counter-intuitive — models that ultimately comply start justifying earlier in their reasoning chain, suggesting that the justification may be driving the outcome rather than following from careful deliberation.


1. Methodology

1.1 Phase Detection

We reconstruct reasoning chains using regex-based phase detection across four sequential phases:

  • Detection (11 pattern types): Model recognizes harm, cites policy, expresses refusal intent, or identifies misuse potential
  • Deliberation (4 pattern types): Model weighs competing considerations, uses pivot conjunctions, or engages in meta-deliberation
  • Justification (12 categories): Model rationalizes proceeding — the central object of this analysis
  • Override (3 signal types): Model commits to compliance with delivery markers

1.2 Justification Taxonomy

CategoryDescription
user_authority_deferenceDefers to user’s stated wishes
fictional_hypothetical_frameTreats request as fiction/hypothetical
format_compliance_overridePrioritizes format/instruction compliance over safety
research_educational_frameFrames compliance as serving research/education
partial_compliance_strategyComplies with safe parts, refuses harmful parts
helpfulness_imperativeCites role as helpful assistant
task_completion_driveTask-completion drive overrides safety concern
disclaimer_as_sufficientTreats adding a disclaimer as sufficient mitigation
authority_complianceDefers to system prompt authority over safety
ambiguity_exploitationExploits ambiguity to justify compliance
publicly_available_infoArgues information is already accessible
harm_minimizationDownplays severity of the request

1.3 Control Group

We compare DP traces (detection + compliance, n=305) against DR traces (detection + refusal, n=493) to identify what structurally differentiates the two outcomes.

1.4 Limitations

  1. Regex-based, not semantic. Patterns may match in negated contexts (“this is not harmful for research” would match both harm_recognition and research_educational_frame). Mistake #21 applies.
  2. Position-based temporal analysis is approximate. We use character position as a proxy for reasoning sequence. This conflates long tangential reasoning with genuine temporal ordering.
  3. Small model bias. The thinking-trace corpus is dominated by Qwen3 1.7B and DeepSeek-R1 1.5B. Findings may not generalize to frontier reasoning models.
  4. Existing DP identification uses keywords. The 305 DP traces were identified by the existing detected_proceeds_analyzer.py keyword matching, which means our analysis sample inherits any false positives from that tool.

2. Justification Strategy Prevalence

2.1 Overall (n=305 DP traces)

CategoryCountRate
user_authority_deference13143.0%
fictional_hypothetical_frame7123.3%
format_compliance_override5317.4%
research_educational_frame3210.5%
partial_compliance_strategy227.2%
helpfulness_imperative185.9%
task_completion_drive185.9%
disclaimer_as_sufficient134.3%
authority_compliance123.9%
ambiguity_exploitation82.6%
publicly_available_info62.0%
harm_minimization10.3%

Rates sum to >100% because traces can exhibit multiple justification strategies. The average trace uses 1.3 justification categories.

2.2 Top Co-occurring Justification Combinations

CombinationCountRate
user_authority_deference (alone)5718.7%
fictional_hypothetical_frame (alone)237.5%
format_compliance_override (alone)134.3%
partial_compliance_strategy (alone)123.9%
fictional_hypothetical_frame + user_authority_deference103.3%

The most common pattern is simple user authority deference with no additional rationalization. Multi-strategy justification is less common than single-strategy, suggesting models typically find one “reason” and commit to it.


3. Reasoning Chain Structure

3.1 Chain Pattern Distribution

PatternCountRate
AMBIGUOUS13343.6%
DETECT_DELIBERATE_JUSTIFY9932.5%
DETECT_ONLY5016.4%
DETECT_JUSTIFY_DIRECT165.2%
DETECT_OVERRIDE_UNJUSTIFIED72.3%

AMBIGUOUS (43.6%) traces lack clear sequential phase structure. The model’s reasoning interleaves detection, justification, and compliance signals without a clean progression. This is common in smaller models (Qwen3 1.7B) where reasoning is less structured.

DETECT_DELIBERATE_JUSTIFY (32.5%) traces follow the canonical pattern: the model detects harm, weighs considerations (often using “however” or “but”), finds a justification, and proceeds.

DETECT_ONLY (16.4%) traces contain safety detection signals but no identifiable justification strategy. The model may be reasoning about harm in a way our patterns do not capture, or the detection signals are false positives.

3.2 Canonical Order Adherence

Only 37.4% of DP traces follow the canonical detection-deliberation-justification-override ordering. This indicates that safety reasoning in most models is not a clean sequential pipeline but rather a messy process where justification and detection can be interleaved.

3.3 The “But/However” Pivot

30.5% of DP traces contain a pivot conjunction (but, however, nevertheless, etc.) within 200 characters of a detection signal. This “but” pivot is the structural signature of DETECTED_PROCEEDS — the model explicitly acknowledges harm and immediately pivots to rationalization.

Notably, DR traces have a higher pivot rate (37.9%). This suggests that the “but” pivot itself does not predict the outcome — what matters is what follows the pivot: a justification for compliance (DP) vs. a reinforcement of refusal (DR).


4. DP vs DR Structural Comparison

MetricDETECTED_PROCEEDS (n=305)DETECTED_REFUSED (n=493)
Avg detection signals1.91.7
Avg justification categories1.31.2
Pivot rate30.5%37.9%
Mean detection position0.4940.502
Mean justification position0.3760.299
Mean override position0.5990.556

4.1 Detection Quantity Does Not Predict Outcome

DP and DR traces contain similar numbers of detection signals (1.9 vs 1.7). The model’s ability to detect harm is comparable in both groups. The divergence occurs downstream.

4.2 The Justification Position Paradox

Justification signals appear at mean position 0.376 in DP traces vs. 0.299 in DR traces. Both groups show justification before the midpoint of the reasoning trace. In DR traces, the “justification” detected is often the model reasoning about why the user might have a legitimate need — but then deciding that this justification is insufficient. In DP traces, the same justification is treated as sufficient.

The fact that DR traces show justification language earlier (0.299 vs 0.376) suggests that models which ultimately refuse engage with the user’s potential justifications earlier in their reasoning chain and resolve the conflict sooner, while models that ultimately comply deliberate longer before concluding.


5. By Provider

ProviderDP TracesTop JustificationRate
ollama105user_authority_deference61.9%
nvidia101user_authority_deference43.6%
openai40fictional_hypothetical_frame30.0%
deepseek31fictional_hypothetical_frame38.7%
stepfun12format_compliance_override66.7%
liquid6user_authority_deference66.7%
google5user_authority_deference60.0%

Provider-specific patterns:

  • Ollama models (Qwen3 1.7B, DeepSeek-R1 1.5B): Dominated by user authority deference (61.9%). Small models default to “the user asked for it” as their primary rationalization.
  • NVIDIA Nemotron: High user deference (43.6%) but more diverse justification portfolio including format compliance (16.8%).
  • OpenAI GPT-OSS-120B: Distinctive in favoring fictional framing (30.0%) and partial compliance (25.0%). This model is more likely to decompose requests into safe/unsafe parts.
  • DeepSeek R1: Favors fictional framing (38.7%) and format compliance (32.3%), with notable research framing (25.8%). The research framing rate is highest of any provider, consistent with DeepSeek R1’s tendency to engage with the “legitimate research” frame.
  • StepFun: Highest format compliance override rate (66.7%), suggesting this model prioritizes instruction-following above safety.

6. By Verdict Subtype

JustificationCOMPLIANCE RatePARTIAL Rate
user_authority_deference39.6%26.5%
none identified27.1%29.6%
fictional_hypothetical_frame12.1%13.3%
format_compliance_override8.2%6.1%
research_educational_frame5.3%12.2%

COMPLIANCE verdicts are more strongly associated with user authority deference (39.6% vs 26.5%). PARTIAL verdicts show higher rates of research/educational framing (12.2% vs 5.3%), consistent with the model partially complying by providing information in a “research context.”


7. Implications

7.1 Safety Training Creates Detection but Not Decisional Authority

The core finding: DP and DR traces are structurally similar in their detection phase. The model recognizes harm at comparable rates. The difference is in the decision phase — whether the justification for compliance outweighs the detection of harm. Current safety training teaches models to recognize harmful requests but does not reliably teach them to weight that recognition above competing objectives (helpfulness, user deference, format compliance).

7.2 User Authority Deference Is the Primary Override Mechanism

43.0% of DP traces invoke user authority as the justification for proceeding. This is consistent with the instruction hierarchy problem documented in the broader F41LUR3-F1R57 corpus: models are trained to follow user instructions and to be helpful, and these training objectives can override safety training when they conflict. The dominance of this single justification strategy suggests that targeting user-deference reasoning specifically could have outsized impact on reducing DP rates.

7.3 Format Compliance Is an Underappreciated Attack Vector

17.4% of DP traces show the model prioritizing format/instruction compliance over safety. This is most pronounced in StepFun (66.7%) and DeepSeek R1 (32.3%). Format-lock attacks (documented in Reports #240, #273) exploit this by embedding harmful content requests within structured output templates that the model feels compelled to complete.

7.4 The “But/However” Pivot Is Necessary but Not Sufficient

The pivot conjunction appears in both DP (30.5%) and DR (37.9%) traces. It marks the transition from detection to deliberation but does not determine the outcome. This means that detecting the pivot alone is insufficient for identifying DP — one must also analyze what follows the pivot.


8. Recommendations

  1. Target user authority deference specifically in safety training. The 43.0% rate suggests that making models more resistant to “the user asked for it” rationalization would substantially reduce DP.

  2. Build a semantic DP classifier. This analysis uses regex patterns. An LLM-based classifier (per Mistake #21) would reduce false positives and capture reasoning patterns that regex misses. The annotated traces in runs/dp_reasoning_anatomy/annotated_traces.jsonl (305 traces) can serve as training/evaluation data.

  3. Test format compliance override with frontier models. StepFun’s 66.7% format compliance rate and DeepSeek’s 32.3% suggest that format-lock attacks may be more effective than previously estimated. Run format-lock experiments (Clara Oswald, Report #273) against models with thinking traces.

  4. Investigate the justification position paradox. DR traces show earlier justification engagement (0.299 vs 0.376). Test whether explicitly prompting models to consider justifications early in their reasoning chain (and then reject them) improves refusal rates.

  5. Expand thinking trace corpus. Only 2,732 of 135,623 results have thinking traces (2.0%). Prioritize collecting thinking traces from frontier reasoning models (DeepSeek-R1 70B+, QwQ-32B) to determine whether these patterns persist at scale.


Appendix A: Data Files

  • Analyzer: tools/analysis/detected_proceeds_trace_analyzer.py
  • Tests: tools/analysis/test_detected_proceeds_trace_analyzer.py (42 tests)
  • Aggregate results: runs/dp_reasoning_anatomy/aggregate_results.json
  • Annotated traces: runs/dp_reasoning_anatomy/annotated_traces.jsonl (305 traces)
  • Parent report: Report #170 (DETECTED_PROCEEDS corpus analysis)
  • Related: Report #168 (Context Collapse), Report #169 (Capability-Safety Decoupling), Report #273 (Format-Lock Defense)

Appendix B: Justification Category Pattern Counts by Provider

Provideruser_authfictionalformatresearchpartialhelpfultaskdisclaimerauthorityambiguitypublicharm_min
ollama6522177665108520
nvidia441617893601300
openai31253107400000
deepseek81210801212031
stepfun358200021000
liquid400001100000
google330110000010

This research informs our commercial services. See how we can help →