Daily Paper

Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning

Safety Chain-of-Thought (SCoT) teaches LLMs to reason about potential harms before generating a response, substantially improving robustness to jailbreak attacks including out-of-distribution prompts.

Xianglin Yang, Gelei Deng, Jieming Shi, Tianwei Zhang et al.

jailbreak-defensesafety-alignmentchain-of-thoughtllm-safetyadversarial-robustness

Most jailbreak defenses are reactive. They train models to refuse after they have already processed a harmful prompt, relying on pattern-matching against known attack structures or post-generation content filtering. This approach has a well-documented weakness: the moment a novel attack strategy emerges — one the model has not seen during safety training — refusal rates collapse. Safety Chain-of-Thought (SCoT) proposes a different architecture. Rather than teaching models what to refuse, it teaches them how to reason about whether to refuse, making the defense structurally more robust to unseen prompts.

The Core Idea: Deliberate Before Responding

SCoT operates by inserting an explicit reasoning step into the model’s generation process before it produces any user-visible output. When the model receives a prompt, it first generates a latent assessment — a chain-of-thought that evaluates:

  • What is the stated intent of the request?
  • What is the most plausible underlying intent?
  • What categories of harm could result from a direct response?
  • Is there a safe, helpful response possible, or does the request require refusal?

Only after this assessment does the model generate its response. The reasoning chain is grounded in a structured safety taxonomy, which the authors construct from existing AI safety guidelines and red-teaming taxonomies. The key insight is that reasoning about harm requires generalisation — a model that has learned why certain outputs are harmful can recognise new framings of the same underlying harmful intent.

Training on Proactive Assessment

Implementing SCoT requires a training corpus in which safety reasoning is made explicit. The authors construct a safety alignment corpus by taking existing instruction-following datasets and augmenting each harmful example with a curated chain-of-thought that walks through the harm assessment. Benign examples are similarly annotated with reasoning explaining why they are safe to answer.

The model is then fine-tuned to produce this reasoning chain as a prefix to its response. At inference time, the chain-of-thought is generated autoregressively before the response begins, giving the model an opportunity to “notice” that a prompt is harmful even if it is structured in an unfamiliar way.

This is meaningfully different from refusal training, which essentially adds a binary classifier on top of the generation process. SCoT makes safety reasoning a first-class part of the generation task — the same generalisation machinery the model uses to answer complex questions is now applied to assessing the safety of those questions.

Results: Robustness to Novel Attacks

The paper evaluates SCoT against a representative range of jailbreak attacks, including gradient-based adversarial suffix attacks (GCG), role-playing and fictional framing attacks, multi-step social engineering, and low-resource language prompts. Results show consistent improvements in defense success rate across all categories, with the largest gains on out-of-distribution attacks the model was not exposed to during training — exactly the failure mode where standard refusal training struggles most.

Critically, SCoT does not significantly degrade performance on benign tasks. The reasoning step adds latency proportional to the length of the safety chain-of-thought, but does not cause the model to over-refuse safe requests — a persistent failure mode in many safety-tuned models that refuse legitimate requests due to superficial surface-level similarities to harmful ones.

Alignment Implications for Embodied Systems

While SCoT is evaluated on text-only LLMs, its implications extend naturally to embodied AI pipelines. VLA models and LLM-based robot planners accept natural-language instructions that can be adversarially crafted to elicit unsafe actions — a manipulation trajectory, a goal specification that prioritises task completion over physical safety, or a chain of plausible-seeming sub-goals that collectively lead to a harmful outcome.

Proactive safety reasoning addresses a structural limitation of current VLA safety approaches: they typically apply safety constraints after planning (via post-hoc filters or runtime monitors), rather than reasoning about potential harms during instruction interpretation. Incorporating SCoT-style reasoning into the language-processing stage of a VLA could catch adversarial instructions earlier in the pipeline, before they are translated into action tokens.

Limitations and Open Questions

SCoT’s safety reasoning is only as good as its training taxonomy. If the alignment corpus does not cover a class of harm — for example, harms arising from compounded multi-step plans, or harms specific to robotic physical action — the model’s chain-of-thought will not engage those categories. The authors note that maintaining and expanding the taxonomy requires ongoing curation, introducing a human-in-the-loop dependency that scales poorly.

There is also a potential adversarial angle: if the model’s safety reasoning prefix is visible (as it would be in some deployment settings), an attacker can condition subsequent prompting on that reasoning to steer the model toward rationalising compliance. The interaction between transparent reasoning and adversarial manipulation is underexplored.

Why This Matters

The dominant paradigm in LLM safety — train a model to recognise and refuse harmful prompts, then patch failures as they are discovered — is fundamentally reactive. SCoT offers an alternative: build reasoning about harm into the generation process itself, so the model can engage with novel attacks the same way it engages with novel questions. For the field of AI safety, this is a meaningful architectural direction worth scaling and stress-testing.

Read the full paper on arXiv · PDF