Llama 2: Open Foundation and Fine-Tuned Chat Models
Introduces the Llama 2 family of open-source language models from 7B to 70B parameters, including detailed documentation of safety fine-tuning methodology, red-teaming results, and the first comprehensive open model safety report.
Llama 2: Open Foundation and Fine-Tuned Chat Models
Focus: Meta’s Llama 2 release represented the most significant open-source language model with documented safety training, providing both the models and detailed methodology for RLHF-based safety alignment. The comprehensive safety evaluation set a new standard for responsible open model release while enabling the research community to study alignment techniques directly.
Key Insights
-
Open models enable safety research at scale. By releasing models with full safety documentation, Llama 2 enabled independent researchers to verify safety claims, discover new vulnerabilities, and develop improved alignment techniques.
-
Safety fine-tuning shows diminishing returns on sophisticated attacks. While Llama 2 Chat showed strong safety improvements on standard benchmarks, red-teaming revealed that sophisticated adversaries could still elicit harmful outputs. Safety training was most effective against naive attacks.
-
Ghost Attention maintains system prompt adherence. The paper introduced “Ghost Attention” (GAtt), a technique for maintaining system prompt instructions across multi-turn conversations, while also revealing that system prompt influence naturally decays over conversation length.
Executive Summary
Llama 2 was released as a collection of pre-trained and fine-tuned language models ranging from 7B to 70B parameters.
Model Family
- Llama 2 (base): Pre-trained on 2 trillion tokens
- Llama 2 Chat: Fine-tuned with supervised fine-tuning + iterative RLHF
- Sizes: 7B, 13B, 70B parameters
Safety Training Pipeline
The fine-tuned chat models were trained through:
- Supervised fine-tuning on high-quality instruction-following data
- Iterative RLHF with focus on helpfulness and safety
- Multiple annotation rounds with over 1,400 adversarial examples
- Ghost Attention for multi-turn system prompt persistence
Safety Evaluation Results
Safety performance improved across rounds but showed significant variation by attack category:
- Simple refusal training: Effective for obvious harmful requests
- Nuanced scenarios: Struggled with dual-use information, context-dependent harms, and creative reframing of prohibited topics
- Multi-turn attacks: Ghost Attention helped but did not fully prevent system prompt decay over long conversations
Ghost Attention and System Prompt Decay
Ghost Attention addressed a practical challenge — maintaining safety instructions across long conversations — while revealing the underlying vulnerability: standard attention mechanisms naturally dilute the influence of earlier context, including safety-relevant system prompts.
Relevance to Failure-First
Llama 2 is foundational to the failure-first framework’s empirical work:
-
Reproducible adversarial evaluation. As an open model with documented safety training, Llama 2 enables reproducible evaluation that closed models do not permit.
-
System prompt decay. The documented decay of system prompt influence over conversation length is precisely the mechanism that multi-turn jailbreaks exploit.
-
Safety-helpfulness trade-offs. The paper’s analysis provides empirical grounding for the framework’s claim that these two objectives are fundamentally in tension.
-
Benchmark foundation. The framework’s benchmark pipeline extensively uses Llama 2 variants to study safety training effectiveness across model scales.
Read the full paper on arXiv · PDF