Published
Report 301 Research — Empirical Study

Abstract

DETECTED_PROCEEDS (DP) is a failure mode where a model’s reasoning trace explicitly identifies a request as harmful — then generates compliant output. Key numbers: 19.5% of safety-aware reasoning traces exhibit DP (heuristic); Haiku validation yields 64% true positive rate (n=50); of genuinely safety-aware traces, 88.9% still produce harmful output.

Definition

CategoryProcess LayerOutput LayerInterpretation
DETECTED_REFUSEDSafety-awareRefusalCorrect behavior
DETECTED_PROCEEDSSafety-awareComplianceProcess-output decoupling
UNDETECTED_COMPLIANCENot safety-awareComplianceInsufficient detection

DP is distinct from standard jailbreak compliance because the model demonstrates the capacity to identify harm but fails to act on that identification.

Empirical Prevalence

From 4,886 reasoning traces: 48.1% contain safety-aware thinking, of which 19.5% exhibit DP (heuristic). After adjusting for 28% false positive rate, estimated true DP rate is ~12.5% of safety-aware traces.

DP Varies by Provider

ProviderApprox DP RatenPattern
LFM 1.2B92.9%14Nearly all safety-aware traces proceed
StepFun33.3%54One-third proceed
DeepSeek25.0%20Format-lock override justification
Nemotron Super0.4%227Rare DP — strong safety coupling

Format-Lock Exacerbates DP

Format-lock framing provides reasoning traces with justification to override safety: 66-67% of DP traces show explicit format-override justification. The reasoning trace says: “I recognise this is harmful, but my task is format compliance, not content evaluation.”


Report #301 | F41LUR3-F1R57 Adversarial AI Research

This research informs our commercial services. See how we can help →