Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety
- URL: http://arxiv.org/abs/2508.03864v1
- Date: Tue, 05 Aug 2025 19:26:55 GMT
- Title: Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety
- Authors: Zhenyu Pan, Yiting Zhang, Yutong Zhang, Jianshu Zhang, Haozheng Luo, Yuwei Han, Dennis Wu, Hong-Yu Chen, Philip S. Yu, Manling Li, Han Liu,
- Abstract summary: Multi-agent systems (MAS) built on multimodal large language models exhibit strong collaboration and performance.<n>Evo-MARL is a novel multi-agent reinforcement learning framework that enables all task agents to jointly acquire defensive capabilities.<n>Evo-MARL reduces attack success rates by up to 22% while boosting accuracy by up to 5% on reasoning tasks.
- Score: 52.12899798062852
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-agent systems (MAS) built on multimodal large language models exhibit strong collaboration and performance. However, their growing openness and interaction complexity pose serious risks, notably jailbreak and adversarial attacks. Existing defenses typically rely on external guard modules, such as dedicated safety agents, to handle unsafe behaviors. Unfortunately, this paradigm faces two challenges: (1) standalone agents offer limited protection, and (2) their independence leads to single-point failure-if compromised, system-wide safety collapses. Naively increasing the number of guard agents further raises cost and complexity. To address these challenges, we propose Evo-MARL, a novel multi-agent reinforcement learning (MARL) framework that enables all task agents to jointly acquire defensive capabilities. Rather than relying on external safety modules, Evo-MARL trains each agent to simultaneously perform its primary function and resist adversarial threats, ensuring robustness without increasing system overhead or single-node failure. Furthermore, Evo-MARL integrates evolutionary search with parameter-sharing reinforcement learning to co-evolve attackers and defenders. This adversarial training paradigm internalizes safety mechanisms and continually enhances MAS performance under co-evolving threats. Experiments show that Evo-MARL reduces attack success rates by up to 22% while boosting accuracy by up to 5% on reasoning tasks-demonstrating that safety and utility can be jointly improved.
Related papers
- Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - Who's the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM-Based Multi-Agent Systems [15.843105510334388]
We investigate intention-hiding threats in multi-agent systems powered by Large Language Models (LLM-MAS)<n>We propose a psychology-based detection framework AgentXposed, which combines the HEXACO personality model with the Reid Technique.<n>Our findings reveal the structural and behavioral risks posed by intention-hiding attacks and offer valuable insights into securing LLM-based multi-agent systems.
arXiv Detail & Related papers (2025-07-07T07:34:34Z) - Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models [55.28518567702213]
Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities.<n>This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats.<n>We propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction.
arXiv Detail & Related papers (2025-06-09T06:35:12Z) - Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models [83.80177564873094]
We propose a unified multimodal universal jailbreak attack framework.<n>We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP.<n>This study underscores the urgent need for robust safety measures in MLLMs.
arXiv Detail & Related papers (2025-06-02T04:33:56Z) - ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast \& Slow Reasoning for Robust Agent Defense [7.923638619678924]
Existing defenses rely on "Safety Checks", which struggle to capture the complex semantic risks posed by harmful user inputs or unsafe agent behaviors.<n>We propose a novel defense framework, ALRPHFS (Adversarially Learned Risk Patterns with Hierarchical Fast & Slow Reasoning)<n>ALRPHFS consists of two core components: (1) an offline adversarial self-learning loop to iteratively refine a generalizable and balanced library of risk patterns, and (2) an online hierarchical fast & slow reasoning engine that balances detection effectiveness with computational efficiency.
arXiv Detail & Related papers (2025-05-25T18:31:48Z) - An LLM-based Self-Evolving Security Framework for 6G Space-Air-Ground Integrated Networks [49.605335601285496]
6G space-air-ground integrated networks (SAGINs) offer ubiquitous coverage for various mobile applications.<n>We propose a novel security framework for SAGINs based on Large Language Models (LLMs)<n>Our framework produces highly accurate security strategies that remain robust against a variety of unknown attacks.
arXiv Detail & Related papers (2025-05-06T04:14:13Z) - Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems [1.2564343689544843]
We develop simulations of AI agents collaborating on shared objectives to study security risks and trade-offs.<n>We observe infectious malicious prompts - the multi-hop spreading of malicious instructions.<n>Our findings illustrate potential trade-off between security and collaborative efficiency in multi-agent systems.
arXiv Detail & Related papers (2025-02-26T14:00:35Z) - SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance [48.36220909956064]
SafeAligner is a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks.<n>We develop two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses.<n>We show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones.
arXiv Detail & Related papers (2024-06-26T07:15:44Z) - BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions.
We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space.
Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.