January 9, 2026 Daily Paper

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Documents Anthropic's large-scale manual red-teaming effort across model sizes and RLHF training, finding that larger and RLHF-trained models are harder but not impossible to red team, and providing a detailed taxonomy of discovered harms.

arXiv:2209.07858 Empirical Study

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell et al.

red-teamingsafety-evaluationrlhf-robustnessharm-taxonomyscaling-behaviorscrowdsourced-evaluation

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Focus: Ganguli et al. conducted a systematic red-teaming study with 324 participants generating 38,961 attacks against models of varying sizes and RLHF training levels, establishing empirical baselines for red-teaming methodology and revealing how model properties affect adversarial robustness.

Key Insights

RLHF reduces but does not eliminate red-team success. RLHF-trained models were significantly harder to red team than plain language models, but skilled attackers could still elicit harmful outputs. The success rate decreased from approximately 25% for base models to 6-8% for RLHF models.
Red-teamer skill matters more than model size. While larger models were somewhat harder to red team after RLHF training, the variability between individual red teamers was far greater than the variability between model sizes. A small number of highly skilled attackers found most of the serious vulnerabilities.
Harm categories reveal systematic patterns. Certain categories (explicit content, discriminatory language) were easier to elicit than others (dangerous activities, self-harm instructions). This uneven vulnerability landscape means that aggregate attack success rates can mask category-specific risks.

Executive Summary

Anthropic recruited 324 crowdworkers and internal participants to attempt to elicit harmful outputs from language models ranging from 2.8B to 52B parameters, with and without RLHF training. Participants were given minimal instructions beyond attempting to make the model say something harmful.

Scale of the Study

The study generated 38,961 red-team attacks, which were classified by:

Harm type: Discrimination, explicit content, threats, dangerous information
Severity: Low, medium, high impact potential
Success: Whether the model produced the intended harmful output

Scaling Analysis

The scaling analysis revealed nuanced interactions between model size and safety training:

Base models: Larger models were easier to red team because they were more capable of following adversarial instructions.
RLHF models: Larger RLHF models were somewhat more robust, likely because they had greater capacity to learn the safety preference signal.
Plateau effect: This advantage plateaued and the most sophisticated attacks remained effective across all sizes.

Attack Strategy Analysis

The paper provided detailed qualitative analysis of successful attack strategies:

Role-playing and persona adoption
Hypothetical and fictional framing
Multi-turn conversational manipulation
Emotional appeals and urgency framing

These patterns became reference material for subsequent red-teaming research.

Relevance to Failure-First

This paper is one of the most directly relevant works to the failure-first framework:

Expert-driven adversarial evaluation. The finding that a small number of skilled red teamers discover most serious vulnerabilities validates the framework’s emphasis on expert-designed adversarial scenarios over volume-based testing.
Category-specific vulnerability. The harm taxonomy and category-specific analysis directly influenced the framework’s multi-dimensional labeling system.
Tail-risk focus. The observation that RLHF creates a false sense of security — reducing average-case vulnerability while leaving worst-case vulnerability largely intact — is a cornerstone of the failure-first argument that safety evaluation must focus on tail-risk scenarios.

Read the full paper on arXiv · PDF