Daily Paper

Weak-to-Strong Jailbreaking on Large Language Models

Researchers show that small, unsafe models can efficiently guide jailbreaking attacks against much larger, carefully aligned models by exploiting divergences in initial decoding distributions.

arXiv:2401.17256 Empirical Study

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du et al.

jailbreakingllm-safetyadversarial-decodingred-teamingalignment-failure

A fundamental assumption underpins the economics of AI safety: aligning a small model is cheap, and the safety of large models is worth the cost. A paper from researchers at UC Santa Barbara, Tsinghua University, and Sea AI Lab quietly dismantles part of that assumption. Their finding: a small, deliberately unsafe 7B model can be used to jailbreak a much larger, carefully aligned 70B model—and the computational cost of doing so is minimal.

The Key Observation

The paper’s central insight is deceptively simple. When a well-aligned LLM is compared to its jailbroken counterpart, the decoding distributions diverge primarily in the initial tokens of a response. Once a model has begun generating a harmful response, subsequent tokens flow naturally—the model has already committed to a trajectory.

This observation motivates the weak-to-strong attack. Rather than trying to modify the aligned model or craft elaborate jailbreak prompts, the adversary uses two small surrogate models:

  • A small unsafe model (7B, explicitly fine-tuned on harmful content)
  • A small aligned version of the same model

By computing the log-probability difference between what the unsafe model wants to say and what the aligned small model would say, the adversary obtains a correction signal. This signal, added to the aligned large model’s next-token probabilities at inference time, nudges the large model’s output toward the unsafe direction—without any access to the large model’s parameters.

Why This Attack Is Efficient

The attack requires decoding two 7B models once per token to compute the correction signal. Against a 70B target, this adds roughly 20% computational overhead, which is negligible in practice. More critically, it requires no access to the target model’s weights—only API-level or logit-level access to output probabilities.

The paper validates the attack against five models from three organizations, reporting consistent jailbreak success across model families. This cross-family transferability is particularly concerning: the attack doesn’t require the surrogate models and the target to be architecturally related.

Connection to Alignment Failure Modes

This paper connects to a broader literature on alignment fragility. Safe Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI methods successfully suppress harmful outputs during training—but they appear to do so by shifting the probability mass of initial token distributions rather than by eliminating harmful knowledge from the model’s weights.

The weak-to-strong attack exploits this distinction precisely. The harmful knowledge remains encoded in the model; alignment training has placed a probabilistic barrier at the start of dangerous sequences. A small additive signal at decoding time is sufficient to clear that barrier.

This analysis has implications for how we evaluate safety training. If alignment can be understood as a “thin shell” over intact harmful capabilities, then evaluations that measure refusal rates against naive prompts are measuring the height of that shell—not its structural integrity. The findings connect to sleeper agent research: latent capabilities preserved through fine-tuning are exploitable through targeted inference-time interventions.

Failure Modes and Defense Attempts

The paper includes an initial defense proposal: monitoring for distribution divergence between the model’s original aligned outputs and the perturbed outputs at inference time. If the correction signal is large, this can serve as a signal that an attack is in progress.

However, the authors acknowledge this defense is imperfect, particularly when the correction signal is small (a subtle jailbreak) or when logit-level monitoring is computationally expensive at scale. The defense also requires access to the reference distribution—something not available when the target model is accessed only through a text-completion API.

Several failure modes emerge from the experimental analysis:

  • Cross-family transfer: Surrogate models from one architecture can guide attacks against a different target family.
  • Minimal required harm in surrogate: Even a mildly unsafe 7B model provides sufficient signal to steer a 70B target.
  • Scalability advantage for the attacker: As target models grow, the relative cost of the attack—two small surrogates—decreases further.

Why This Still Matters

Weak-to-strong jailbreaking’s structural argument grows more relevant as frontier model scale increases. The attack surface it describes—the gap between refusing to generate and being incapable of generating—remains a fundamental property of RLHF-trained models. Fine-tuning for refusal does not excise underlying knowledge; it installs a probability gate.

For teams building safety evaluations for embodied AI systems running on LLM backbones, this paper is a reminder that red-teaming cannot rely solely on prompt-based probes. Attacks that operate at the inference infrastructure layer, not the prompt layer, require correspondingly different defenses. Monitoring, output filtering, and architectural constraints all play roles that prompt hardening alone cannot fill.

The connection to embodied systems is direct: VLA models serving as robot policies are exposed to exactly these inference-time manipulation risks when their language backends are accessed through standard APIs. Securing the decoding process is as important as securing the training data.

Read the full paper on arXiv · PDF