A Unified Debugging Approach via LLM-Based Multi-Agent Synergy
- URL: http://arxiv.org/abs/2404.17153v2
- Date: Wed, 23 Oct 2024 11:56:26 GMT
- Title: A Unified Debugging Approach via LLM-Based Multi-Agent Synergy
- Authors: Cheryl Lee, Chunqiu Steven Xia, Longji Yang, Jen-tse Huang, Zhouruixin Zhu, Lingming Zhang, Michael R. Lyu,
- Abstract summary: FixAgent is an end-to-end framework for unified debug through multi-agent synergy.
It significantly outperforms state-of-the-art repair methods, fixing 1.25$times$ to 2.56$times$ bugs on the repo-level benchmark, Defects4J.
- Score: 39.11825182386288
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Software debugging is a time-consuming endeavor involving a series of steps, such as fault localization and patch generation, each requiring thorough analysis and a deep understanding of the underlying logic. While large language models (LLMs) demonstrate promising potential in coding tasks, their performance in debugging remains limited. Current LLM-based methods often focus on isolated steps and struggle with complex bugs. In this paper, we propose the first end-to-end framework, FixAgent, for unified debugging through multi-agent synergy. It mimics the entire cognitive processes of developers, with each agent specialized as a particular component of this process rather than mirroring the actions of an independent expert as in previous multi-agent systems. Agents are coordinated through a three-level design, following a cognitive model of debugging, allowing adaptive handling of bugs with varying complexities. Experiments on extensive benchmarks demonstrate that FixAgent significantly outperforms state-of-the-art repair methods, fixing 1.25$\times$ to 2.56$\times$ bugs on the repo-level benchmark, Defects4J. This performance is achieved without requiring ground-truth root-cause code statements, unlike the baselines. Our source code is available on https://github.com/AcceptePapier/UniDebugger.
Related papers
- Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.
We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.
The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z) - Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution [22.03052751722933]
Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads.
We have identified the most prevalent errors -- such as ModuleNotFoundError and TypeError -- and highlighted particularly challenging errors like OSError and database-related issues.
arXiv Detail & Related papers (2025-03-16T06:24:51Z) - SolBench: A Dataset and Benchmark for Evaluating Functional Correctness in Solidity Code Completion and Repair [51.0686873716938]
We introduce SolBench, a benchmark for evaluating the functional correctness of Solidity smart contracts generated by code completion models.
We propose a Retrieval-Augmented Code Repair framework to verify functional correctness of smart contracts.
Results show that code repair and retrieval techniques effectively enhance the correctness of smart contract completion while reducing computational costs.
arXiv Detail & Related papers (2025-03-03T01:55:20Z) - PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.82146219495792]
In this paper, we propose a hierarchical agent framework named PC-Agent.
From the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content.
From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture.
arXiv Detail & Related papers (2025-02-20T05:41:55Z) - MALT: Improving Reasoning with Multi-Agent LLM Training [66.9481561915524]
MALT (Multi-Agent LLM Training) is a novel post-training strategy that divides the reasoning process into generation, verification, and refinement steps.
On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40% respectively.
arXiv Detail & Related papers (2024-12-02T19:30:36Z) - An Empirical Study on LLM-based Agents for Automated Bug Fixing [2.433168823911037]
Large language models (LLMs) and LLM-based Agents have been applied to fix bugs automatically.
We examine seven proprietary and open-source systems on the SWE-bench Lite benchmark for automated bug fixing.
arXiv Detail & Related papers (2024-11-15T14:19:15Z) - Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks [39.084974125007165]
We introduce Magentic-One, a high-performing open-source agentic system for solving complex tasks.
Magentic-One uses a multi-agent architecture where a lead agent, the Orchestrator, tracks progress, and re-plans to recover from errors.
We show that Magentic-One achieves statistically competitive performance to the state-of-the-art on three diverse and challenging agentic benchmarks.
arXiv Detail & Related papers (2024-11-07T06:36:19Z) - REDO: Execution-Free Runtime Error Detection for COding Agents [3.9903610503301072]
Execution-free Error Detection for COding Agents (REDO) is a method that integrates runtime errors with static analysis tools.
We demonstrate that REDO outperforms current state-of-the-art methods by achieving a 11.0% higher accuracy and a 9.1% higher weighted F1 score.
arXiv Detail & Related papers (2024-10-10T18:06:29Z) - From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging [5.910272203315325]
We introduce Multi-Granularity Debugger (MG Debugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity.
MG Debugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error.
It achieves an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix.
arXiv Detail & Related papers (2024-10-02T03:57:21Z) - Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation [49.27250832754313]
We present AgentCOT, a llm-based autonomous agent framework.
At each step, AgentCOT selects an action and executes it to yield an intermediate result with supporting evidence.
We introduce two new strategies to enhance the performance of AgentCOT.
arXiv Detail & Related papers (2024-09-19T02:20:06Z) - COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis [29.667170755786508]
We introduce EVAL, a benchmark for evaluating the abilities of Large Language Models.
We propose the COmmunicative Agent-based data SynThesis framework, which employs a multi-agent system to generate high-quality training data.
Results demonstrate that COAST-generated data outperform human-curated and GPT-4-generated data.
arXiv Detail & Related papers (2024-08-09T11:35:44Z) - On the Resilience of Multi-Agent Systems with Malicious Agents [58.79302663733702]
This paper investigates what is the resilience of multi-agent system structures under malicious agents.
We devise two methods, AutoTransform and AutoInject, to transform any agent into a malicious one.
We show that two defense methods, introducing a mechanism for each agent to challenge others' outputs, or an additional agent to review and correct messages, can enhance system resilience.
arXiv Detail & Related papers (2024-08-02T03:25:20Z) - AgentFL: Scaling LLM-based Fault Localization to Project-Level Context [11.147750199280813]
This paper presents AgentFL, a multi-agent system based on ChatGPT for automated fault localization.
By simulating the behavior of a human developer, AgentFL models the FL task as a three-step process, which involves comprehension, navigation, and confirmation.
The evaluation on the widely used Defects4J-V1.2.0 benchmark shows that AgentFL can localize 157 out of 395 bugs within Top-1.
arXiv Detail & Related papers (2024-03-25T01:58:19Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - MAgIC: Investigation of Large Language Model Powered Multi-Agent in
Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing.
As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework.
This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z) - Multi-Agent Reinforcement Learning for Microprocessor Design Space
Exploration [71.95914457415624]
Microprocessor architects are increasingly resorting to domain-specific customization in the quest for high-performance and energy-efficiency.
We propose an alternative formulation that leverages Multi-Agent RL (MARL) to tackle this problem.
Our evaluation shows that the MARL formulation consistently outperforms single-agent RL baselines.
arXiv Detail & Related papers (2022-11-29T17:10:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.