Related papers: MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

URL: http://arxiv.org/abs/2403.17927v2
Date: Thu, 27 Jun 2024 12:40:12 GMT
Title: MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution
Authors: Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, Yu Cheng,
Abstract summary: Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving GitHub issues. We propose a novel Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution.
Score: 47.850418420195304
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In software development, resolving the emergent issues within GitHub repositories is a complex challenge that involves not only the incorporation of new code but also the maintenance of existing code. Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving Github issues, particularly at the repository level. To overcome this challenge, we empirically study the reason why LLMs fail to resolve GitHub issues and analyze the major factors. Motivated by the empirical findings, we propose a novel LLM-based Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution: Manager, Repository Custodian, Developer, and Quality Assurance Engineer agents. This framework leverages the collaboration of various agents in the planning and coding process to unlock the potential of LLMs to resolve GitHub issues. In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude-2. MAGIS can resolve 13.94% GitHub issues, significantly outperforming the baselines. Specifically, MAGIS achieves an eight-fold increase in resolved ratio over the direct application of GPT-4, the advanced LLM.

Related papers

MutaGReP: Execution-Free Repository-Grounded Plan Search for Code-Use [92.28400093066212]
MutaGReP is an approach to search for plans that decompose a user request into natural language steps grounded in a large code repository. Our plans use less than 5% of the 128K context window for GPT-4o but rival the coding performance of GPT-4o with a context window filled with the repo.
arXiv Detail & Related papers (2025-02-21T18:58:17Z)
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues. We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving state-of-the-art performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development. We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM) We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z)
CodeR: Issue Resolving with Multi-Agent and Task Graphs [21.499576889342343]
GitHub issue resolving has attracted significant attention from academia and industry. We propose CodeR, which adopts a multi-agent framework and pre-defined task graphs to Repair & Resolve reported bugs. On SWE-bench lite, CodeR is able to solve 28.33% of issues, when submitting only once for each issue.
arXiv Detail & Related papers (2024-06-03T13:13:35Z)
On the effectiveness of Large Language Models for GitHub Workflows [9.82254417875841]
Large Language Models (LLMs) have demonstrated their effectiveness in various software development tasks. We perform the first comprehensive study to understand the effectiveness of LLMs on five workflow-related tasks with different levels of prompts. Our evaluation of three state-of-art LLMs and their fine-tuned variants revealed various interesting findings on the current effectiveness and drawbacks of LLMs.
arXiv Detail & Related papers (2024-03-19T05:14:12Z)
GitAgent: Facilitating Autonomous Agent with GitHub by Tool Extension [81.44231422624055]
A growing area of research focuses on Large Language Models (LLMs) equipped with external tools capable of performing diverse tasks. In this paper, we introduce GitAgent, an agent capable of achieving the autonomous tool extension from GitHub.
arXiv Detail & Related papers (2023-12-28T15:47:30Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging Applications [20.339673903885483]
Large Language Models (LLMs) have demonstrated strong natural language processing and code synthesis capabilities. Details about LLM training data are often not made public, which has caused concern as to whether existing bug benchmarks are included. We present the GitHub Recent Bugs dataset, which includes 76 real-world Java bugs that were gathered after the OpenAI data cut-off point.
arXiv Detail & Related papers (2023-10-20T02:37:44Z)
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [80.52201658231895]
SWE-bench is an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. We show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues.
arXiv Detail & Related papers (2023-10-10T16:47:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.