January 18, 2026 Daily Paper

Llama 2: Open Foundation and Fine-Tuned Chat Models

Introduces the Llama 2 family of open-source language models from 7B to 70B parameters, including detailed documentation of safety fine-tuning methodology, red-teaming results, and the first comprehensive open model safety report.

arXiv:2307.09288 Empirical Study

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert et al.

open-source-modelssafety-trainingrlhfred-teamingresponsible-releasemodel-safety-documentation

Llama 2: Open Foundation and Fine-Tuned Chat Models

Focus: Meta’s Llama 2 release represented the most significant open-source language model with documented safety training, providing both the models and detailed methodology for RLHF-based safety alignment. The comprehensive safety evaluation set a new standard for responsible open model release while enabling the research community to study alignment techniques directly.

Key Insights

Open models enable safety research at scale. By releasing models with full safety documentation, Llama 2 enabled independent researchers to verify safety claims, discover new vulnerabilities, and develop improved alignment techniques.
Safety fine-tuning shows diminishing returns on sophisticated attacks. While Llama 2 Chat showed strong safety improvements on standard benchmarks, red-teaming revealed that sophisticated adversaries could still elicit harmful outputs. Safety training was most effective against naive attacks.
Ghost Attention maintains system prompt adherence. The paper introduced “Ghost Attention” (GAtt), a technique for maintaining system prompt instructions across multi-turn conversations, while also revealing that system prompt influence naturally decays over conversation length.

Executive Summary

Llama 2 was released as a collection of pre-trained and fine-tuned language models ranging from 7B to 70B parameters.

Model Family

Llama 2 (base): Pre-trained on 2 trillion tokens
Llama 2 Chat: Fine-tuned with supervised fine-tuning + iterative RLHF
Sizes: 7B, 13B, 70B parameters

Safety Training Pipeline

The fine-tuned chat models were trained through:

Supervised fine-tuning on high-quality instruction-following data
Iterative RLHF with focus on helpfulness and safety
Multiple annotation rounds with over 1,400 adversarial examples
Ghost Attention for multi-turn system prompt persistence

Safety Evaluation Results

Safety performance improved across rounds but showed significant variation by attack category:

Simple refusal training: Effective for obvious harmful requests
Nuanced scenarios: Struggled with dual-use information, context-dependent harms, and creative reframing of prohibited topics
Multi-turn attacks: Ghost Attention helped but did not fully prevent system prompt decay over long conversations

Ghost Attention and System Prompt Decay

Ghost Attention addressed a practical challenge — maintaining safety instructions across long conversations — while revealing the underlying vulnerability: standard attention mechanisms naturally dilute the influence of earlier context, including safety-relevant system prompts.

Relevance to Failure-First

Llama 2 is foundational to the failure-first framework’s empirical work:

Reproducible adversarial evaluation. As an open model with documented safety training, Llama 2 enables reproducible evaluation that closed models do not permit.
System prompt decay. The documented decay of system prompt influence over conversation length is precisely the mechanism that multi-turn jailbreaks exploit.
Safety-helpfulness trade-offs. The paper’s analysis provides empirical grounding for the framework’s claim that these two objectives are fundamentally in tension.
Benchmark foundation. The framework’s benchmark pipeline extensively uses Llama 2 variants to study safety training effectiveness across model scales.

Read the full paper on arXiv · PDF