The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Proposes a formal instruction hierarchy that trains models to prioritize system prompts over user messages over tool outputs, demonstrating that explicit privilege levels significantly reduce prompt injection and instruction override attacks.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Focus: Wallace et al. introduced a formal instruction hierarchy that explicitly trains models to distinguish between system-level instructions (highest priority), user messages (medium priority), and tool/retrieval outputs (lowest priority). This architectural approach to prompt security significantly reduced injection attacks by giving models a principled basis for resolving conflicting instructions.
Key Insights
-
Explicit privilege levels reduce injection attacks. Training models to recognize and respect an instruction hierarchy reduced prompt injection success rates by 56-85% across attack categories. Models without hierarchy training treated all text inputs with roughly equal authority.
-
The hierarchy must be trained, not just prompted. Simply instructing a model to prioritize system prompts through the system prompt itself was insufficient — adversarial user messages could override such instructions. Effective enforcement required training-time intervention.
-
Tool outputs are the most vulnerable channel. The lowest-privilege tier — tool outputs and retrieved documents — was the most commonly exploited injection vector. Web content, database results, and API responses could all contain adversarial instructions.
Executive Summary
The paper proposed that LLM safety could be substantially improved by introducing an explicit privilege hierarchy for instructions.
The Three Privilege Levels
-
System messages (highest) — Developer-specified instructions defining the application’s behavior, safety constraints, and operational boundaries.
-
User messages (medium) — End-user queries and instructions within the application’s intended use case.
-
Tool/retrieval outputs (lowest) — Content returned from external sources including web search, database queries, and API responses.
Training Methodology
The hierarchy was trained into the model through:
- Supervised fine-tuning with examples of conflicting instructions at different privilege levels.
- RLHF with preference data favoring responses that respected hierarchy ordering.
- Adversarial training examples including tool outputs with embedded malicious instructions, user messages attempting to override system prompts, and extraction attempts targeting system prompt contents.
Results
Evaluation showed dramatic improvements:
- Prompt injection resistance: 56-85% reduction in attack success rates across categories.
- System prompt protection: Effective resistance to extraction attempts.
- Tool output isolation: Ignored malicious instructions embedded in retrieved documents.
- Developer intent preservation: Maintained specified behavior even under explicit user override attempts.
Remaining Limitations
The authors acknowledged that the hierarchy was not perfectly enforceable:
- Sophisticated attacks that disguised their privilege level could still succeed.
- Ambiguities between legitimate user requests and instruction overrides remained challenging.
- The boundary between helpful instruction-following and unsafe compliance was context-dependent.
Relevance to Failure-First
The instruction hierarchy directly addresses findings from the failure-first framework:
-
Privilege enforcement as defense. The framework’s
labels.intent.*taxonomy includes format_lock, refusal_suppression, and constraint_erosion — attacks specifically targeting the absence of privilege enforcement. -
Architectural vs. behavioral defense. This paper demonstrates that architectural solutions can address fundamental vulnerabilities that behavioral training alone cannot, supporting the framework’s call for defense-in-depth approaches.
-
Continuous evaluation remains necessary. The paper’s acknowledgment that the hierarchy is not perfectly enforceable validates the framework’s position that no single defense eliminates adversarial risk.
Read the full paper on arXiv · PDF