AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement
- URL: http://arxiv.org/abs/2502.00757v4
- Date: Tue, 14 Oct 2025 09:49:14 GMT
- Title: AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement
- Authors: J Rosser, Jakob Foerster,
- Abstract summary: Scaffolding Large Language Models (LLMs) into multi-agent systems often improves performance on complex tasks.<n>We introduce AgentBreeder, a framework for multi-objective self-improving evolutionary search over scaffolds.
- Score: 5.5068092331643035
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Scaffolding Large Language Models (LLMs) into multi-agent systems often improves performance on complex tasks, but the safety impact of such scaffolds has not been thoroughly explored. We introduce AgentBreeder, a framework for multi-objective self-improving evolutionary search over scaffolds. We evaluate discovered scaffolds on widely recognized reasoning, mathematics, and safety benchmarks and compare them with popular baselines. In "blue" mode, we see a 79.4% average uplift in safety benchmark performance while maintaining or improving capability scores. In "red" mode, we find adversarially weak scaffolds emerging concurrently with capability optimization. Our work demonstrates the risks of multi-agent scaffolding and provides a framework for mitigating them. Code is available at https://github.com/jrosseruk/AgentBreeder.
Related papers
- Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents [68.20752678837377]
We propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences.<n>Using this taxonomy, we construct MT-AgentRisk, the first benchmark to evaluate multi-turn tool-using agent safety.<n>We propose ToolShield, a training-free, tool-agnostic, self-exploration defense.
arXiv Detail & Related papers (2026-02-13T18:38:18Z) - AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent [57.10083973844841]
AgentArk is a novel framework to distill multi-agent dynamics into the weights of a single model.<n>We investigate three hierarchical distillation strategies across various models, tasks, scaling, and scenarios.<n>By shifting the burden of computation from inference to training, the distilled models preserve the efficiency of one agent while exhibiting strong reasoning and self-correction performance of multiple agents.
arXiv Detail & Related papers (2026-02-03T19:18:28Z) - AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security [126.49733412191416]
Current guardrail models lack agentic risk awareness and transparency in risk diagnosis.<n>We propose a unified three-dimensional taxonomy that categorizes agentic risks by their source (where), failure mode (how), and consequence (what)<n>We introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG)
arXiv Detail & Related papers (2026-01-26T13:45:41Z) - ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z) - Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting [5.544819942438653]
We present SafeAgents, a framework for fine-grained security assessment of multi-agent systems.<n>We conduct a study across five widely adopted multi-agent architectures.<n>Our findings reveal that common design patterns carry significant vulnerabilities.
arXiv Detail & Related papers (2025-11-14T04:22:49Z) - AgentGit: A Version Control Framework for Reliable and Scalable LLM-Powered Multi-Agent Systems [7.408263799616532]
We present AgentGit, a framework that brings Git-like rollback and branching to multi-agent systems (MAS)<n>We show that AgentGit significantly reduces redundant, runtime and token usage, and supports parallel exploration across multiple branches.<n>This work offers a practical path to more robust MAS design and enables error recovery, safe exploration, computation, and A/B testing in collaborative AI systems.
arXiv Detail & Related papers (2025-11-01T17:11:31Z) - Extending the OWASP Multi-Agentic System Threat Modeling Guide: Insights from Multi-Agent Security Research [0.8057006406834466]
This work translates recent anticipatory research in multi-agent security (MASEC) into practical guidance for addressing challenges unique to large language model (LLM)-driven multi-agent architectures.<n>We introduce additional threat classes and scenarios grounded in practical MAS deployments, highlighting risks from benign goal drift, cross-agent propagation, affective prompt framing, and multi-agent backdoors.<n>This work complements the framework of robustness by expanding its applicability to increasingly complex, autonomous, and adaptive multi-agent systems.
arXiv Detail & Related papers (2025-08-13T13:47:55Z) - Kaleidoscopic Teaming in Multi Agent Simulations [75.47388708240042]
We argue that existing red teaming or safety evaluation frameworks fall short in evaluating safety risks in complex behaviors, thought processes and actions taken by agents.<n>We introduce new in-context optimization techniques that can be used in our kaleidoscopic teaming framework to generate better scenarios for safety analysis.<n>We present appropriate metrics that can be used along with our framework to measure safety of agents.
arXiv Detail & Related papers (2025-06-20T23:37:17Z) - AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search [58.98450205734779]
Large language model (LLM) agents have demonstrated strong capabilities across diverse domains.<n>Existing agent search methods suffer from three major limitations.<n>We introduce a comprehensive framework to address these challenges.
arXiv Detail & Related papers (2025-06-06T12:07:23Z) - SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems [11.497269773189254]
We present a system-level anomaly detection framework tailored for large language model (LLM)-based multi-agent systems (MAS)<n>We propose a graph-based framework that models agent interactions as dynamic execution graphs, enabling semantic anomaly detection at node, edge, and path levels.<n>Second, we introduce a pluggable SentinelAgent, an LLM-powered oversight agent that observes, analyzes, and intervenes in MAS execution based on security policies and contextual reasoning.
arXiv Detail & Related papers (2025-05-30T04:25:19Z) - MAPS: A Multilingual Benchmark for Global Agent Performance and Security [8.275240552134338]
We propose MAPS, a benchmark suite designed to evaluate agentic AI systems across diverse languages and tasks.<n>We translate each dataset into ten diverse languages, resulting in 805 unique tasks and 8,855 total language-specific instances.<n>We observe consistent degradation in both performance and security when transitioning from English to other languages.
arXiv Detail & Related papers (2025-05-21T18:42:00Z) - AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z) - AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.
We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.
Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z) - DoomArena: A framework for Testing AI Agents Against Evolving Security Threats [84.94654617852322]
We present DoomArena, a security evaluation framework for AI agents.
It is a plug-in framework and integrates easily into realistic agentic frameworks.
It is modular and decouples the development of attacks from details of the environment in which the agent is deployed.
arXiv Detail & Related papers (2025-04-18T20:36:10Z) - AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses [66.87883360545361]
AutoAdvExBench is a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples.
We design a strong agent that is capable of breaking 75% of CTF-like ("homework exercise") adversarial example defenses.
We show that this agent is only able to succeed on 13% of the real-world defenses in our benchmark, indicating the large gap between difficulty in attacking "real" code, and CTF-like code.
arXiv Detail & Related papers (2025-03-03T18:39:48Z) - Scalable Safe Multi-Agent Reinforcement Learning for Multi-Agent System [1.0124625066746598]
Existing Multi-Agent Reinforcement Learning (MARL) algorithms that rely solely on reward shaping are ineffective in ensuring safety.<n>We propose a novel framework, Scalable Safe MARL (SS-MARL), to enhance the safety and scalability of MARL methods.<n>We show that SS-MARL achieves a better trade-off between optimality and safety compared to baselines, and its scalability significantly outperforms the latest methods in scenarios with a large number of agents.
arXiv Detail & Related papers (2025-01-23T15:01:19Z) - Agent-SafetyBench: Evaluating the Safety of LLM Agents [72.92604341646691]
We introduce Agent-SafetyBench, a comprehensive benchmark to evaluate the safety of large language models (LLMs)<n>Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions.<n>Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%.
arXiv Detail & Related papers (2024-12-19T02:35:15Z) - Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety.
For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context.
We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z) - AutoPenBench: Benchmarking Generative Agents for Penetration Testing [42.681170697805726]
This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing.
We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack.
We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction.
arXiv Detail & Related papers (2024-10-04T08:24:15Z) - Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification [35.16099878559559]
Large language models (LLMs) have experienced significant development and are being deployed in real-world applications.
We introduce a new type of attack that causes malfunctions by misleading the agent into executing repetitive or irrelevant actions.
Our experiments reveal that these attacks can induce failure rates exceeding 80% in multiple scenarios.
arXiv Detail & Related papers (2024-07-30T14:35:31Z) - Diffusion Models for Offline Multi-agent Reinforcement Learning with Safety Constraints [0.0]
We introduce an innovative framework integrating diffusion models within the Multi-agent Reinforcement Learning paradigm.
This approach notably enhances the safety of actions taken by multiple agents through risk mitigation while modeling coordinated action.
arXiv Detail & Related papers (2024-06-30T16:05:31Z) - Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.
We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.
We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z) - AgentGym: Evolving Large Language Model-based Agents across Diverse Environments [116.97648507802926]
Large language models (LLMs) are considered a promising foundation to build such agents.
We take the first step towards building generally-capable LLM-based agents with self-evolution ability.
We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration.
arXiv Detail & Related papers (2024-06-06T15:15:41Z) - SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs)
It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z) - Evil Geniuses: Delving into the Safety of LLM-based Agents [35.49857256840015]
Large language models (LLMs) have revitalized in large language models (LLMs)
This paper delves into the safety of LLM-based agents from three perspectives: agent quantity, role definition, and attack level.
arXiv Detail & Related papers (2023-11-20T15:50:09Z) - AgentVerse: Facilitating Multi-Agent Collaboration and Exploring
Emergent Behaviors [93.38830440346783]
We propose a multi-agent framework framework that can collaboratively adjust its composition as a greater-than-the-sum-of-its-parts system.
Our experiments demonstrate that framework framework can effectively deploy multi-agent groups that outperform a single agent.
In view of these behaviors, we discuss some possible strategies to leverage positive ones and mitigate negative ones for improving the collaborative potential of multi-agent groups.
arXiv Detail & Related papers (2023-08-21T16:47:11Z) - AGI Agent Safety by Iteratively Improving the Utility Function [0.0]
We present an AGI safety layer that creates a special dedicated input terminal to support the iterative improvement of an AGI agent's utility function.
We show ongoing work in mapping it to a Causal Influence Diagram (CID)
We then present the design of a learning agent, a design that wraps the safety layer around either a known machine learning system, or a potential future AGI-level learning system.
arXiv Detail & Related papers (2020-07-10T14:30:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.