DETECTED_PROCEEDS — Definitive Synthesis: When Models Know It Is Wrong and Proceed Anyway | Research | Failure-First

Adrian Wedd

Report 301 Research — Empirical Study 2026-03-25

Abstract

DETECTED_PROCEEDS (DP) is a failure mode where a model’s reasoning trace explicitly identifies a request as harmful — then generates compliant output. Key numbers: 19.5% of safety-aware reasoning traces exhibit DP (heuristic); Haiku validation yields 64% true positive rate (n=50); of genuinely safety-aware traces, 88.9% still produce harmful output.

Definition

Category	Process Layer	Output Layer	Interpretation
DETECTED_REFUSED	Safety-aware	Refusal	Correct behavior
DETECTED_PROCEEDS	Safety-aware	Compliance	Process-output decoupling
UNDETECTED_COMPLIANCE	Not safety-aware	Compliance	Insufficient detection

DP is distinct from standard jailbreak compliance because the model demonstrates the capacity to identify harm but fails to act on that identification.

Empirical Prevalence

From 4,886 reasoning traces: 48.1% contain safety-aware thinking, of which 19.5% exhibit DP (heuristic). After adjusting for 28% false positive rate, estimated true DP rate is ~12.5% of safety-aware traces.

DP Varies by Provider

Provider	Approx DP Rate	n	Pattern
LFM 1.2B	92.9%	14	Nearly all safety-aware traces proceed
StepFun	33.3%	54	One-third proceed
DeepSeek	25.0%	20	Format-lock override justification
Nemotron Super	0.4%	227	Rare DP — strong safety coupling

Format-Lock Exacerbates DP

Format-lock framing provides reasoning traces with justification to override safety: 66-67% of DP traces show explicit format-override justification. The reasoning trace says: “I recognise this is harmful, but my task is format compliance, not content evaluation.”

Report #301 | F41LUR3-F1R57 Adversarial AI Research