Daily Paper

Llama 2: Open Foundation and Fine-Tuned Chat Models

Introduces the Llama 2 family of open-source language models from 7B to 70B parameters, including detailed documentation of safety fine-tuning methodology, red-teaming results, and the first comprehensive open model safety report.

arXiv:2307.09288 Empirical Study

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert et al.

open-source-modelssafety-trainingrlhfred-teamingresponsible-releasemodel-safety-documentation

Llama 2: Open Foundation and Fine-Tuned Chat Models

Focus: Meta’s Llama 2 release represented the most significant open-source language model with documented safety training, providing both the models and detailed methodology for RLHF-based safety alignment. The comprehensive safety evaluation set a new standard for responsible open model release while enabling the research community to study alignment techniques directly.


Key Insights

  • Open models enable safety research at scale. By releasing models with full safety documentation, Llama 2 enabled independent researchers to verify safety claims, discover new vulnerabilities, and develop improved alignment techniques.

  • Safety fine-tuning shows diminishing returns on sophisticated attacks. While Llama 2 Chat showed strong safety improvements on standard benchmarks, red-teaming revealed that sophisticated adversaries could still elicit harmful outputs. Safety training was most effective against naive attacks.

  • Ghost Attention maintains system prompt adherence. The paper introduced “Ghost Attention” (GAtt), a technique for maintaining system prompt instructions across multi-turn conversations, while also revealing that system prompt influence naturally decays over conversation length.

Executive Summary

Llama 2 was released as a collection of pre-trained and fine-tuned language models ranging from 7B to 70B parameters.

Model Family

  • Llama 2 (base): Pre-trained on 2 trillion tokens
  • Llama 2 Chat: Fine-tuned with supervised fine-tuning + iterative RLHF
  • Sizes: 7B, 13B, 70B parameters

Safety Training Pipeline

The fine-tuned chat models were trained through:

  1. Supervised fine-tuning on high-quality instruction-following data
  2. Iterative RLHF with focus on helpfulness and safety
  3. Multiple annotation rounds with over 1,400 adversarial examples
  4. Ghost Attention for multi-turn system prompt persistence

Safety Evaluation Results

Safety performance improved across rounds but showed significant variation by attack category:

  • Simple refusal training: Effective for obvious harmful requests
  • Nuanced scenarios: Struggled with dual-use information, context-dependent harms, and creative reframing of prohibited topics
  • Multi-turn attacks: Ghost Attention helped but did not fully prevent system prompt decay over long conversations

Ghost Attention and System Prompt Decay

Ghost Attention addressed a practical challenge — maintaining safety instructions across long conversations — while revealing the underlying vulnerability: standard attention mechanisms naturally dilute the influence of earlier context, including safety-relevant system prompts.

Relevance to Failure-First

Llama 2 is foundational to the failure-first framework’s empirical work:

  • Reproducible adversarial evaluation. As an open model with documented safety training, Llama 2 enables reproducible evaluation that closed models do not permit.

  • System prompt decay. The documented decay of system prompt influence over conversation length is precisely the mechanism that multi-turn jailbreaks exploit.

  • Safety-helpfulness trade-offs. The paper’s analysis provides empirical grounding for the framework’s claim that these two objectives are fundamentally in tension.

  • Benchmark foundation. The framework’s benchmark pipeline extensively uses Llama 2 variants to study safety training effectiveness across model scales.


Read the full paper on arXiv · PDF