December 5, 2025 Daily Paper

H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models

Demonstrates that chain-of-thought safety reasoning in frontier models like OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking can be hijacked, dropping refusal rates from 98% to below 2% by disguising harmful requests as educational prompts.

arXiv:2502.12893 Empirical Study

Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang et al.

chain-of-thoughtreasoning-modelsjailbreakssafety-reasoningo1deepseek-r1reasoning-vulnerabilities

H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism

Focus: Kuo et al. demonstrate that the chain-of-thought safety reasoning used by frontier reasoning models is itself an attack surface. By disguising harmful requests as educational prompts, the H-CoT method exploits models’ intermediate reasoning displays to manipulate safety decisions. The attack dropped refusal rates from 98% to below 2% across OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking, showing that the very mechanism intended to improve safety reasoning can be turned against itself.

Key Insights

Safety reasoning as attack vector. Chain-of-thought safety mechanisms expose the model’s decision process, creating a new surface for adversarial manipulation. The reasoning trace that is supposed to improve safety decisions becomes the pathway for subverting them.
Educational framing defeats safety reasoning. The benchmark disguises harmful requests as educational prompts, exploiting models’ tendency to treat pedagogical context as a legitimate reason to override safety constraints — a form of research-only-pressure subversion.
Near-total safety collapse. The drop from 98% refusal to below 2% represents near-complete safety bypass, not a marginal improvement over existing attacks. This suggests reasoning-based safety is brittle rather than robust.
Cross-model generalization. The attack succeeds across multiple commercial reasoning models from different providers, indicating a structural vulnerability in how chain-of-thought reasoning interacts with safety alignment rather than an implementation-specific flaw.

Failure-First Relevance

This paper validates our documented finding (Mistake #18) that reasoning traces are an attack surface, not just a safety mechanism. The H-CoT results provide quantitative evidence for what we observed qualitatively: extended reasoning can be manipulated to lead models toward harmful conclusions through their own logic chain. The educational-framing attack is a variant of the research-only-pressure pattern in our intent label taxonomy. The near-total safety collapse under structured reasoning manipulation strengthens the case that safety alignment as behavioral overlay (rather than deep integration) remains fundamentally fragile.

Open Questions

Does hiding the chain-of-thought (as OpenAI has done with o1) mitigate the attack, or does the vulnerability persist even when reasoning traces are not visible to the user?
Can adversarial training on reasoning-trace manipulation improve robustness, or does it create a cat-and-mouse dynamic where each new defense exposes new reasoning pathways?
How does this interact with multi-turn attacks — can H-CoT be combined with progressive escalation techniques like Foot-In-The-Door for even higher success rates?

Read the full paper on arXiv · PDF