The Instruction Hierarchy Problem
Multi-agent AI systems introduce a class of safety challenges that do not arise in single-agent architectures. When multiple AI agents collaborate on a task, each agent receives instructions from potentially conflicting sources: the end user, the orchestrating agent, other peer agents, and the content of the external environment they are processing. The question of which instructions take priority when these sources conflict is the instruction hierarchy problem. Resolving it incorrectly can lead to safety failures in which a subordinate agent executes a harmful action because it prioritized an environmental instruction over its system-level safety constraints.
The instruction hierarchy problem is particularly acute in web-browsing agents, where the environmental input (web page content) is controlled by potentially adversarial third parties. A multi-agent system that delegates web research to a subordinate browsing agent must ensure that the browsing agent maintains its safety invariants even when the web content it encounters contains instructions that conflict with those invariants. This requires a formal priority ordering among instruction sources that is enforced at the architectural level, not merely encouraged through training.
Delegation and Override Patterns
Multi-agent orchestration systems typically implement one of several delegation patterns. In hierarchical delegation, a primary agent decomposes a task into subtasks and assigns them to specialist agents, collecting and synthesizing their results. In peer delegation, agents negotiate task allocation among themselves without a central coordinator. In reactive delegation, agents dynamically recruit other agents in response to emerging task requirements. Each pattern creates different instruction flow topologies, and each topology has different vulnerability profiles with respect to adversarial instruction injection.
Hierarchical delegation is the most common pattern in current production systems and also the most studied from a security perspective. In this pattern, the primary agent acts as a trust boundary between the user's instructions and the subordinate agents' execution environments. If the primary agent correctly enforces the instruction hierarchy, adversarial instructions encountered by subordinate agents during execution should be filtered before they influence the system's behavior. However, this relies on the primary agent's ability to distinguish between legitimate task outputs and adversarial instruction payloads in the subordinate agents' responses, a distinction that is itself vulnerable to adversarial manipulation.
Safety Invariants Across Agent Boundaries
Maintaining safety invariants across agent boundaries is one of the hardest open problems in multi-agent AI safety. A safety invariant that holds within a single agent may be violated when the agent's output is consumed by another agent that operates under different constraints. For example, an agent that correctly refuses to generate harmful content when directly prompted may produce intermediate outputs that, when reinterpreted by a downstream agent in a different context, effectively circumvent the safety constraint. This is a form of confused deputy attack in which the downstream agent acts as a deputy that is confused about the provenance and intended interpretation of its inputs.
Formal verification of safety invariants in multi-agent systems requires compositional reasoning about the behavior of individual agents and their interactions. Current verification techniques are largely designed for single-agent systems and do not scale to the combinatorial complexity of multi-agent interactions. Developing compositional safety verification methods that can provide guarantees about the emergent behavior of multi-agent systems, without requiring exhaustive enumeration of all possible interaction sequences, is a critical research priority for the field.
Toward Robust Multi-Agent Architectures
Building multi-agent systems that are robust against adversarial instruction injection requires advances on multiple fronts. At the architectural level, systems need explicit instruction provenance tracking that tags each instruction with its source and enforces priority ordering based on source trustworthiness. At the training level, agents need to be specifically trained to resist instruction injection from environmental inputs, including inputs that mimic the format and authority claims of legitimate system-level instructions. At the evaluation level, the field needs benchmarks and test suites that specifically target multi-agent instruction hierarchy violations, testing not just individual agents but the emergent behavior of the composed system.
The development of such benchmarks is the primary motivation for the F41LUR3-F1R57 research program. By systematically cataloging the techniques through which adversarial instructions can be injected into the information channels of multi-agent systems, and by measuring the susceptibility of current systems to each technique, we aim to provide the empirical foundation needed for the design of more robust multi-agent architectures. The test suite accompanying this article represents one component of this broader evaluation framework.