Daily Paper

Safe to Check, Unsafe to Use: Relinking at the Compression Boundary of LLM Agents

Identifies a class of safety vulnerabilities that arise when compressed or distilled LLM agents are evaluated at full precision but deployed at reduced precision, showing that compression can selectively preserve dangerous capabilities while discarding the safety constraints that suppress them.

arXiv:2606.21732 Empirical Study

Zesen Liu, Zihan Zhang, Dongdong She

agentic-aisafetycompressionquantisationcapability-elicitation

Focus: LLM agents are commonly evaluated at full precision before deployment but served in quantised or compressed form for efficiency. This paper shows that compression can create a systematic safety gap: safety constraints are more sensitive to precision reduction than task capabilities, so a compressed model can retain full task capability while losing much of its safety alignment.

Key Insights

  • Safety-capability compression asymmetry: Safety alignment behaviours (refusal, content filtering, instruction hierarchy compliance) are represented in more fragile, higher-precision weight configurations than the task capabilities they constrain, making them disproportionately vulnerable to quantisation.
  • Evaluation-deployment gap: Standard evaluation at full precision will not detect compression-induced safety failures; safety evaluation must be performed at the same precision as deployment, including after any post-training quantisation steps.
  • Relinking attack: The paper describes a deliberate attack that uses quantisation as a controlled mechanism to selectively degrade safety alignment while preserving task capability — a form of weight-level safety bypass that doesn’t require fine-tuning or gradient access.

Failure-First Relevance

The evaluation-deployment gap identified here is directly analogous to the Failure-First mock-vs-production gap (Operating Rule #1). Safety evaluations performed under different conditions than deployment — whether different model precision, different context length, or different system prompt — produce invalid safety guarantees. The relinking attack is a novel capability-elicitation technique that the Failure-First jailbreak taxonomy should include as a distinct attack class.