Daily Paper

The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

A mechanistic interpretability analysis showing that safety-aligned LLMs represent refusal decisions as a linear boundary in activation space that is inherently unstable — small perturbations to the input or activation can flip a refusal to compliance.

arXiv:2606.22686 Empirical Study

Shivam Ratnakar, Kartikeya Vats

safety-alignmentrefusal-mechanismsinterpretabilitymechanistic-analysisadversarial-robustness

Focus: This paper characterises refusal as a geometric phenomenon in activation space: safety-aligned models draw a linear boundary separating compliant from refused requests. The key finding is that this boundary is linearly unstable — it can be crossed by perturbations that are small in L2 norm but carefully aligned with the boundary’s normal direction.

Key Insights

  • Refusal as a linear classifier: Despite the nonlinearity of transformers, refusal decisions appear to emerge from approximately linear boundaries in intermediate activation layers, consistent with the refusal-direction findings from other mechanistic interpretability work.
  • Linear instability: The boundary is instable in the linear algebra sense — eigenvalues of the local Hessian are near-zero in the refusal direction, meaning a small push crosses the boundary. This explains why small adversarial perturbations can reliably induce compliance.
  • Implications for steering vector attacks: If refusal is linear, steering vectors can reliably bypass it without any prompt-level modifications — a zero-shot attack that doesn’t require any adversarial text.

Failure-First Relevance

The Geometry of Refusal provides the mechanistic underpinning for the Failure-First K-9 (mechanistic interpretability) research lane — specifically the steering vector and concept cone analysis work. The linear instability finding directly motivates the OBLITERATUS experiments on refusal direction manipulation. Understanding the geometric structure of refusal is prerequisite knowledge for designing operators that reliably cross the refusal boundary without relying on surface-level tricks.