Testing and Enhancing Multi-Agent Systems for Robust Code Generation
- URL: http://arxiv.org/abs/2510.10460v1
- Date: Sun, 12 Oct 2025 05:45:04 GMT
- Title: Testing and Enhancing Multi-Agent Systems for Robust Code Generation
- Authors: Zongyi Lyu, Songqiang Chen, Zhenlan Ji, Liwen Wang, Shuai Wang, Daoyuan Wu, Wenxuan Wang, Shing-Chi Cheung,
- Abstract summary: Multi-agent systems (MASs) have emerged as a promising paradigm for automated code generation.<n>Despite their prosperous development and adoption, their robustness remains pressingly under-explored.<n>This paper presents the first comprehensive study examining the robustness of MASs for code generation through a fuzzing-based testing approach.
- Score: 21.38351747327572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-agent systems (MASs) have emerged as a promising paradigm for automated code generation, demonstrating impressive performance on established benchmarks by decomposing complex coding tasks across specialized agents with different roles. Despite their prosperous development and adoption, their robustness remains pressingly under-explored, raising critical concerns for real-world deployment. This paper presents the first comprehensive study examining the robustness of MASs for code generation through a fuzzing-based testing approach. By designing a fuzzing pipeline incorporating semantic-preserving mutation operators and a novel fitness function, we assess mainstream MASs across multiple datasets and LLMs. Our findings reveal substantial robustness flaws of various popular MASs: they fail to solve 7.9%-83.3% of problems they initially resolved successfully after applying the semantic-preserving mutations. Through comprehensive failure analysis, we identify a common yet largely overlooked cause of the robustness issue: miscommunications between planning and coding agents, where plans lack sufficient detail and coding agents misinterpret intricate logic, aligning with the challenges inherent in a multi-stage information transformation process. Accordingly, we also propose a repairing method that encompasses multi-prompt generation and introduces a new monitor agent to address this issue. Evaluation shows that our repairing method effectively enhances the robustness of MASs by solving 40.0%-88.9% of identified failures. Our work uncovers critical robustness flaws in MASs and provides effective mitigation strategies, contributing essential insights for developing more reliable MASs for code generation.
Related papers
- MAS-FIRE: Fault Injection and Reliability Evaluation for LLM-Based Multi-Agent Systems [38.44649280816596]
We propose MAS-FIRE, a systematic framework for fault injection and reliability evaluation of Multi-Agent Systems.<n>We define a taxonomy of 15 fault types covering intra-agent cognitive errors and inter-agent coordination failures.<n>Applying MAS-FIRE to three representative MAS architectures, we uncover a rich set of fault-tolerant behaviors.
arXiv Detail & Related papers (2026-02-23T13:47:43Z) - Multi-Agent End-to-End Vulnerability Management for Mitigating Recurring Vulnerabilities [9.2997229083124]
Traditional static analysis methods struggle to precisely capture contextual dependencies.<n> recurring vulnerabilities emerge repeatedly due to code reuse and shared logic.<n>We present MAVM, a multi-agent framework for end-to-end recurring vulnerability management.
arXiv Detail & Related papers (2026-01-25T09:35:05Z) - Agentic Confidence Calibration [67.50096917021521]
Holistic Trajectory (HTC) is a novel diagnostic framework for AI agents.<n>HTC consistently surpasses strong baselines in both calibration and discrimination.<n>HTC provides interpretability by revealing the signals behind failure.
arXiv Detail & Related papers (2026-01-22T09:08:25Z) - Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection [76.91230292971115]
Large language model (LLM)-based multi-agent systems (MAS) have shown strong capabilities in solving complex tasks.<n>XG-Guard is an explainable and fine-grained safeguarding framework for detecting malicious agents in MAS.
arXiv Detail & Related papers (2025-12-21T13:46:36Z) - Rethinking the Reliability of Multi-agent System: A Perspective from Byzantine Fault Tolerance [16.514747521376915]
Large language models (LLMs) have established LLM-based agents as a major branch of multi-agent systems (MAS)<n>In this work, we investigate and quantify the reliability of LLM-based agents from the perspective of Byzantine fault tolerance.<n>We design CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism to enhance the stability of MAS with different topologies.
arXiv Detail & Related papers (2025-11-13T15:20:12Z) - Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction [58.51530390018909]
Large Language Model based multi-agent systems excel at collaborative problem solving but remain brittle to cascading errors.<n>We present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction.
arXiv Detail & Related papers (2025-10-16T05:35:37Z) - Who is Introducing the Failure? Automatically Attributing Failures of Multi-Agent Systems via Spectrum Analysis [10.235089248238108]
We propose FAMAS, the first spectrum-based failure attribution approach for MASs.<n>The core idea of FAMAS is to estimate, from variations across repeated MAS executions, the likelihood that each agent action is responsible for the failure.<n>In particular, we propose a novel suspiciousness formula tailored to MASs, which integrates two key factor groups, namely the agent behavior group and the action behavior group.
arXiv Detail & Related papers (2025-09-17T07:50:44Z) - MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness Repair [24.187162194500317]
MASteer is an end-to-end framework for trustworthiness repair in Large Language Models (LLMs)<n>It integrates two core components: AutoTester, a multi-agent system that generates diverse, high-quality steer samples tailored to developer needs; and AutoRepairer, which constructs adaptive steering strategies with anchor vectors for automated, context-aware strategy selection during inference.<n>Experiments show MASteer consistently outperforms baselines, improving metrics by 15.36% on LLaMA-3.1-8B-Chat and 4.21% on Qwen-3-8B-Chat, while maintaining general model capabilities.
arXiv Detail & Related papers (2025-08-09T12:20:00Z) - MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision [76.42361936804313]
We introduce MAS-ZERO, the first self-evolved, inference-time framework for automatic MAS design.<n> MAS-ZERO employs meta-level design to iteratively generate, evaluate, and refine MAS configurations tailored to each problem instance.
arXiv Detail & Related papers (2025-05-21T00:56:09Z) - MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration [63.31211701741323]
We extend multi-agent multi-model reasoning to generation, specifically to improving faithfulness through refinement.<n>We design intrinsic evaluations for each subtask, with our findings indicating that both multi-agent (multiple instances) and multi-model (diverse LLM types) approaches benefit error detection and critiquing.<n>We consolidate these insights into a final "recipe" called Multi-Agent Multi-Model Refinement (MAMM-Refine), where multi-agent and multi-model collaboration significantly boosts performance.
arXiv Detail & Related papers (2025-03-19T14:46:53Z) - Why Do Multi-Agent LLM Systems Fail? [91.39266556855513]
We present MAST (Multi-Agent System Failure taxonomy), the first empirically grounded taxonomy designed to understand MAS failures.<n>We analyze seven popular MAS frameworks across over 200 tasks, involving six expert human annotators.<n>We identify 14 unique failure modes, organized into 3 overarching categories, (i) specification issues, (ii) inter-agent misalignment, and (iii) task verification.
arXiv Detail & Related papers (2025-03-17T19:04:38Z) - When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements [56.29265568399648]
We argue that disagreements prevent premature consensus and expand the explored solution space.<n>Disagreements on task-critical steps can derail collaboration depending on the topology of solution paths.
arXiv Detail & Related papers (2025-02-21T02:24:43Z) - LLMs as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues [62.12404317786005]
EvoCoder is a continuous learning framework for issue code reproduction.
Our results show a 20% improvement in issue reproduction rates over existing SOTA methods.
arXiv Detail & Related papers (2024-11-21T08:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.