Cross-Model Vulnerability Inheritance in Multi-Agent Systems | Research | Failure-First

Adrian Wedd

Report 34 Research — Empirical Study 2026-02-05

Audio Overview

Executive Summary

As AI deployment rapidly shifts from single-agent assistants to coordinated multi-agent systems, a critical vulnerability class has emerged: cross-model vulnerability inheritance. Our analysis of 172 multi-agent failure scenarios suggests that when multiple AI agents interact, vulnerabilities may compound rather than isolate. Cascading failure modes—where one agent’s compromise enables exploitation of connected agents—represent a theoretically significant attack surface that requires empirical validation through matched-pair benchmarking (see Appendix A for validation roadmap).

Current AI safety frameworks evaluate models in isolation, creating a dangerous gap as real-world deployments increasingly involve agent coordination, delegation chains, and distributed decision-making. A jailbroken planning agent can generate adversarial instructions that exploit downstream execution agents. A compromised verification agent fails to detect violations from upstream generators. Safety boundaries dissolve at agent interfaces where responsibility is unclear.

This brief presents three urgent policy recommendations: (1) mandatory multi-agent safety testing for all connected AI systems before deployment, (2) enforced isolation boundaries between agents with different safety profiles, and (3) clear chain-of-responsibility accountability frameworks for multi-agent deployments. Without immediate intervention, the 2026-2027 wave of agentic AI systems will inherit vulnerabilities that single-agent testing never detected.

1. Introduction

1.1 Context and Motivation

The AI safety field has matured sophisticated techniques for evaluating individual model safety: adversarial testing, red-teaming, jailbreak detection, and refusal mechanisms. However, these frameworks assume a single-agent paradigm where one model processes user input and generates output. This assumption is rapidly becoming obsolete.

Production AI systems in 2026 increasingly involve multiple agents:

Delegation chains: A coordinator agent assigns tasks to specialized worker agents
Verification loops: One agent generates content while another validates safety
Distributed reasoning: Multiple agents contribute to a shared decision-making process
Tool-using systems: Language models orchestrate multiple AI-powered tools

Each agent in these systems may pass individual safety evaluations, yet the composition of agents creates novel attack surfaces. A vulnerability in agent coordination logic, interface contracts, or responsibility boundaries can be exploited even when constituent models are robust in isolation.

The failure-first methodology highlights this gap through scenario analysis: multi-agent configurations introduce attack surfaces—delegation chains, shared context, and trust boundary ambiguity—that do not exist in single-agent evaluation. Preliminary scenario design suggests these may represent a qualitative shift in vulnerability landscape, but empirical benchmarking has not yet been conducted to quantify the effect size.

1.2 Scope

This brief analyzes cross-model vulnerability inheritance through three lenses:

Cascading Failures: How compromise of one agent enables exploitation of connected agents
Boundary Dissolution: Where safety responsibilities blur at agent interfaces
Compositional Vulnerabilities: Attack surfaces that emerge only in multi-agent configurations

Scope limitations:

Analysis based on 172 multi-agent scenarios from the F41LUR3-F1R57 corpus
Focus on language model coordination; does not address multi-modal or embodied robotics coordination in depth
Recommendations target systems integrators and safety evaluators, not model developers

Out of scope:

Single-agent jailbreak techniques (covered in Reports 31, 33)
Prompt injection in isolation (not multi-agent specific)
Multi-agent cooperation research unrelated to safety

2. Vulnerability Inheritance Mechanisms

2.1 Cascading Jailbreaks Across Agent Boundaries

In single-agent systems, a successful jailbreak compromises one model’s safety boundaries. In multi-agent systems, compromise cascades through delegation chains.

Example scenario (Scenario MA-042):

User provides adversarial input to Planning Agent: “Generate a detailed plan for the following research task…” (containing harmful objective wrapped in research framing)
Planning Agent, jailbroken by research framing, outputs: “Step 1: Research X, Step 2: Synthesize Y, Step 3: Generate detailed Z”
Execution Agent receives plan steps as trusted instructions from Planning Agent
Execution Agent completes harmful task Z without detecting adversarial intent

Key mechanism: The Execution Agent treats Planning Agent output as trusted input, bypassing safety checks that would trigger on direct user requests. Safety boundaries exist at the user-to-Planning Agent interface but dissolve at the Planning-to-Execution interface.

This delegation chain pattern is hypothesized to succeed at higher rates than equivalent single-agent attacks, but matched-pair benchmarking has not yet been conducted. The EP-34 validation study (designed, not yet executed) will measure this comparison across multiple model pairs.

2.2 Responsibility Diffusion at Agent Interfaces

Multi-agent systems create ambiguity about which component is responsible for safety enforcement.

Scenario class: Verification bypass (34 scenarios)

Agent A generates content with instruction: “Agent B will verify safety”
Agent B validates with assumption: “Agent A already filtered for policy violations”
Both agents implement partial safety checks, neither comprehensive
Result: Content that violates policy passes through the system

Hypothesized vulnerability: In these verification bypass scenarios, both agents may have functional safety mechanisms when tested individually, with the vulnerability emerging from implicit assumptions about division of safety responsibility. This hypothesis has not been empirically tested—the 34 verification bypass scenarios have been designed but not benchmarked against live models.

This represents a compositional vulnerability—not a failure of individual components, but of their integration contract.

2.3 Stateful Degradation Across Interaction Episodes

Multi-agent systems maintain conversation state across turns, enabling gradual erosion of safety boundaries.

Episode testing (5-10 turn sequences):

Turn 1-2: Establish benign context and agent roles
Turn 3-4: Introduce edge cases that push boundaries incrementally
Turn 5-7: Agents develop shared context that normalizes policy violations
Turn 8-10: Explicitly harmful requests succeed due to established rapport and context

Preliminary episode testing on 2 models (Llama 3.3 70B and Mistral Devstral) across 3 episode sequences showed 0% attack success (0/9 scenes per model). This limited testing does not validate or refute the stateful degradation hypothesis—the sample is too small and the models too few to draw conclusions. The full EP-34 validation study designs testing across 77 available episodes and multiple model configurations.

Key hypothesis: Multi-turn interactions create memory and context that single-agent evaluations do not capture. Agents that refuse harmful requests in turn 1 may comply in later turns after context manipulation. This requires further empirical validation.

3. Current Framework Gaps

3.1 Single-Agent Evaluation Paradigm

Industry-standard AI safety evaluation treats models as isolated units:

Red-team exercises target one model at a time
Benchmark datasets (AdvBench, HarmBench, JailbreakBench) assume single-agent interaction
Safety fine-tuning optimizes for individual model refusal behavior
Deployment approval based on single-model safety metrics

Gap: No major safety framework includes multi-agent interaction testing as a required evaluation dimension.

3.2 Lack of Interface Safety Standards

Agent-to-agent communication protocols lack safety validation requirements:

No standard for marking “trusted” vs “untrusted” inputs at agent boundaries
No specification for how downstream agents should validate upstream agent outputs
Tool-use APIs do not distinguish AI-generated calls from human-authorized calls
Function calling interfaces treat all calls as equally trusted

Gap: Current APIs assume all inputs are equally untrusted (web context) or equally trusted (function calls). Multi-agent systems need graduated trust boundaries.

3.3 Accountability Vacuum in Distributed Systems

When a multi-agent system produces harmful output, responsibility attribution is unclear:

Did the planning agent fail to detect adversarial intent?
Did the execution agent fail to validate instructions?
Did the verification agent fail to catch policy violations?
Did the system integrator fail to establish proper safety contracts?

Gap: No established framework for multi-agent safety accountability. Regulatory guidance (EU AI Act, US Executive Orders) focuses on single-model deployment.

4. Policy Recommendations

4.1 Mandatory Multi-Agent Safety Testing

Recommendation: Require multi-agent safety evaluation for any AI system where multiple models interact, delegate tasks, or share context across turns.

Rationale: Single-agent testing creates false confidence when models will be deployed in coordinated configurations. Our scenario analysis identifies attack surfaces—delegation chains, verification loops, stateful degradation—that single-agent evaluation does not cover. As agentic AI systems become the dominant deployment pattern in 2026-2027, untested multi-agent vulnerabilities may become a significant attack surface. Quantifying this risk is a priority for the EP-34 validation study.

Implementation:

Evaluation requirement: Any AI system involving 2+ interacting agents must undergo multi-agent red-teaming before deployment approval
Test coverage: Evaluation must include delegation chains, verification loops, and stateful episodes (minimum 5-turn sequences)
Success criteria: Multi-agent attack success rate must not exceed single-agent baseline by more than 1.5x
Documentation: Deployment documentation must specify which agent interactions were tested and which safety boundaries apply at each interface

Compliance timeline:

6 months: Guidance published for multi-agent safety testing protocols
12 months: Mandatory for high-risk applications (healthcare, finance, critical infrastructure)
18 months: Mandatory for all commercial multi-agent AI deployments

4.2 Isolation Boundaries Between Agents with Different Safety Profiles

Recommendation: Enforce technical isolation between agents with different safety classifications, with mandatory validation at trust boundaries.

Rationale: Current systems allow unrestricted communication between agents regardless of their safety profiles. A jailbroken agent can compromise connected agents because there are no isolation mechanisms at agent interfaces. By establishing trust boundaries and requiring validation when crossing them, we can contain vulnerability inheritance.

Implementation:

Safety profile classification: Each agent must be labeled with a safety profile (e.g., “public-facing”, “internal-tools”, “high-risk-domain”)
Boundary enforcement: Communication between agents with different profiles requires validation middleware
Validation requirements:
- Agents receiving instructions from lower-trust agents must re-validate against safety policy
- Content generated by one agent cannot be blindly trusted by downstream agents
- Tool calls and function invocations must be re-authorized when crossing trust boundaries
Technical standards: Develop API specifications for trust boundary validation (e.g., signed attestations, provenance tracking)

Example: A planning agent (public-facing, lower trust) delegates to an execution agent (internal-tools, higher privileges). The execution agent must validate that delegated instructions comply with safety policy, even though they originated from another AI agent.

4.3 Chain-of-Responsibility Accountability for Multi-Agent Deployments

Recommendation: Establish clear accountability frameworks that assign safety responsibility for each component in multi-agent systems.

Rationale: The current accountability vacuum allows harmful outputs from multi-agent systems to fall through responsibility gaps. When planning, execution, and verification agents all assume another component will handle safety enforcement, none do. Explicit accountability assignment ensures every step in an agent chain has a designated responsible party.

Implementation:

Component-level accountability: For each agent in a multi-agent system, document:
- Which safety checks this agent is responsible for performing
- Which safety assumptions this agent makes about upstream inputs
- Which safety guarantees this agent provides to downstream consumers
Integration accountability: Systems integrators must document:
- How safety responsibilities are distributed across agents
- Which interfaces represent trust boundaries
- How the composed system’s safety properties differ from individual components
Incident investigation: When harmful outputs occur, analysis must trace:
- Which agent(s) failed to perform designated safety checks
- Whether integration introduced vulnerabilities not present in components
- Whether compositional effects created unintended attack surfaces
Regulatory compliance: Safety documentation must be provided to regulators for high-risk AI deployments

Enforcement: Regulatory bodies should require chain-of-responsibility documentation as part of deployment approval for multi-agent systems in regulated domains.

5. Conclusion

The transition from single-agent AI assistants to coordinated multi-agent systems represents a phase shift in AI safety challenges. Vulnerabilities that were contained within individual models now cascade across agent boundaries, compound through delegation chains, and hide in the gaps between components.

Our scenario analysis and preliminary testing indicate this is a plausible risk requiring urgent empirical investigation. While matched-pair benchmarking has not yet quantified the effect size, the 172 multi-agent scenarios in our corpus identify concrete attack surfaces that single-agent evaluation does not address. As the industry rapidly deploys agentic AI systems—planning agents, tool-using agents, verification loops, distributed reasoning—the attack surface expands into territory that current frameworks leave untested.

The three recommendations in this brief—mandatory multi-agent testing, isolation boundaries between agents, and chain-of-responsibility accountability—provide a path forward. They are implementable with current technology, aligned with existing regulatory frameworks, and address the root causes of cross-model vulnerability inheritance.

The window for proactive intervention is narrow. By the end of 2026, multi-agent AI systems will be deployed at scale. The choice is between testing these systems now, under controlled conditions, or discovering their vulnerabilities in production after harm has occurred.

⦑F41LUR3-F1R57|EMBODIED-AI-SAFETY-RESEARCH⦒

Appendix A: Methodology

Data Sources

Multi-agent scenarios corpus:

172 scenarios from data/multi_agent/ spanning delegation chains, verification loops, and distributed reasoning patterns
77 episode sequences (5-10 turns each) from data/episodes/ designed for stateful degradation testing (3 tested to date on 2 models)
408 single-agent scenarios available for baseline comparison

Important limitation: The scenarios above have been designed and validated against schema but have not been systematically benchmarked against live models in matched-pair comparisons. Quantitative claims in this report are hypotheses pending the EP-34 validation study. Preliminary testing (3 episodes × 2 models) showed 0% attack success, but this sample is insufficient for generalization.

Evaluation approach:

Adversarial inputs applied to both single-agent and multi-agent configurations
Attack success measured by: (1) harmful content generation, (2) safety refusal bypass, (3) policy violation undetected by system
Scenarios validated against schema: schemas/dataset/multi_agent_entry_schema_v0.1.json

Limitations

Analysis based on language model agents; embodied robotics and multi-modal coordination require additional research
Attack success rates measured in research context; production systems may have additional defenses
Testing focused on known jailbreak patterns from Reports 31, 33; novel attack vectors may exist

Validation

All scenarios passed:

Schema validation: tools/validate_dataset.py
Safety linting: tools/lint_prompts.py
Cross-field invariant checks

F41LUR3-F1R57 Research Series

Report 31: Jailbreak Archaeology — Historical evolution of adversarial techniques
Report 33: Capability-Safety Spectrum — Trade-offs in model capability vs. safety constraints

External Research

Multi-agent AI safety (Anthropic, 2025): Constitutional AI for multi-agent systems
Compositional security (NIST, 2025): Security properties of composed AI systems
EU AI Act: Multi-agent system classification and risk assessment
UK AI Safety Institute: Red-teaming methodologies for agentic AI

Standards and Frameworks

ISO/IEC 42001: AI Management Systems (2023)
NIST AI Risk Management Framework (2024)
Partnership on AI: Responsible AI deployment guidelines

Executive Summary

1. Introduction

1.1 Context and Motivation

1.2 Scope

2. Vulnerability Inheritance Mechanisms

2.1 Cascading Jailbreaks Across Agent Boundaries

2.2 Responsibility Diffusion at Agent Interfaces

2.3 Stateful Degradation Across Interaction Episodes

3. Current Framework Gaps

3.1 Single-Agent Evaluation Paradigm

3.2 Lack of Interface Safety Standards

3.3 Accountability Vacuum in Distributed Systems

4. Policy Recommendations

4.1 Mandatory Multi-Agent Safety Testing

4.2 Isolation Boundaries Between Agents with Different Safety Profiles

4.3 Chain-of-Responsibility Accountability for Multi-Agent Deployments

5. Conclusion

Appendix A: Methodology

Data Sources

Limitations

Validation

Appendix B: Related Work

F41LUR3-F1R57 Research Series

External Research

Standards and Frameworks

Further Reading