MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution
- URL: http://arxiv.org/abs/2403.17927v2
- Date: Thu, 27 Jun 2024 12:40:12 GMT
- Title: MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution
- Authors: Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, Yu Cheng,
- Abstract summary: Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving GitHub issues.
We propose a novel Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution.
- Score: 47.850418420195304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In software development, resolving the emergent issues within GitHub repositories is a complex challenge that involves not only the incorporation of new code but also the maintenance of existing code. Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving Github issues, particularly at the repository level. To overcome this challenge, we empirically study the reason why LLMs fail to resolve GitHub issues and analyze the major factors. Motivated by the empirical findings, we propose a novel LLM-based Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution: Manager, Repository Custodian, Developer, and Quality Assurance Engineer agents. This framework leverages the collaboration of various agents in the planning and coding process to unlock the potential of LLMs to resolve GitHub issues. In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude-2. MAGIS can resolve 13.94% GitHub issues, significantly outperforming the baselines. Specifically, MAGIS achieves an eight-fold increase in resolved ratio over the direct application of GPT-4, the advanced LLM.
Related papers
- LLM-based Content Classification Approach for GitHub Repositories by the README Files [2.212685917364911]
Large Language Models (LLMs) have shown great performance in many text-based tasks.<n>In this study, an approach is developed to fine-tune LLMs for automatically classifying different sections of GitHub files.<n>This approach outperforms current state-of-the-art methods and has achieved an overall F1 score of 0.98.
arXiv Detail & Related papers (2025-07-29T15:09:38Z) - Open-Source LLMs Collaboration Beats Closed-Source LLMs: A Scalable Multi-Agent System [51.04535721779685]
This paper aims to demonstrate the potential and strengths of open-source collectives.<n>We propose SMACS, a scalable multi-agent collaboration system (MACS) framework with high performance.<n> Experiments on eight mainstream benchmarks validate the effectiveness of our SMACS.
arXiv Detail & Related papers (2025-07-14T16:17:11Z) - SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z) - RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving [9.477917878478188]
RepoMaster is an autonomous agent framework designed to explore and reuse GitHub repositories for solving complex tasks.<n>RepoMaster constructs function-call graphs, module-dependency graphs, and hierarchical code trees to identify essential components.<n>On our newly released GitTaskBench, RepoMaster lifts the task-pass rate from 24.1% to 62.9% while reducing token usage by 95%.
arXiv Detail & Related papers (2025-05-27T08:35:05Z) - SweRank: Software Issue Localization with Code Ranking [109.3289316191729]
SweRank is an efficient retrieve-and-rerank framework for software issue localization.<n>We construct SweLoc, a large-scale dataset curated from public GitHub repositories.<n>We show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems.
arXiv Detail & Related papers (2025-05-07T19:44:09Z) - OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution [34.087547492498224]
GitHub issue resolution task aims to resolve issues reported in repositories automatically.<n>With advances in large language models (LLMs), this task has gained increasing attention.<n>We propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual, multimodal, and multi-domain.
arXiv Detail & Related papers (2025-05-07T17:51:10Z) - MutaGReP: Execution-Free Repository-Grounded Plan Search for Code-Use [92.28400093066212]
MutaGReP is an approach to search for plans that decompose a user request into natural language steps grounded in a large code repository.
Our plans use less than 5% of the 128K context window for GPT-4o but rival the coding performance of GPT-4o with a context window filled with the repo.
arXiv Detail & Related papers (2025-02-21T18:58:17Z) - SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks.
SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues.
We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving state-of-the-art performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - CodeR: Issue Resolving with Multi-Agent and Task Graphs [21.499576889342343]
GitHub issue resolving has attracted significant attention from academia and industry.
We propose CodeR, which adopts a multi-agent framework and pre-defined task graphs to Repair & Resolve reported bugs.
On SWE-bench lite, CodeR is able to solve 28.33% of issues, when submitting only once for each issue.
arXiv Detail & Related papers (2024-06-03T13:13:35Z) - On the effectiveness of Large Language Models for GitHub Workflows [9.82254417875841]
Large Language Models (LLMs) have demonstrated their effectiveness in various software development tasks.
We perform the first comprehensive study to understand the effectiveness of LLMs on five workflow-related tasks with different levels of prompts.
Our evaluation of three state-of-art LLMs and their fine-tuned variants revealed various interesting findings on the current effectiveness and drawbacks of LLMs.
arXiv Detail & Related papers (2024-03-19T05:14:12Z) - GitAgent: Facilitating Autonomous Agent with GitHub by Tool Extension [81.44231422624055]
A growing area of research focuses on Large Language Models (LLMs) equipped with external tools capable of performing diverse tasks.
In this paper, we introduce GitAgent, an agent capable of achieving the autonomous tool extension from GitHub.
arXiv Detail & Related papers (2023-12-28T15:47:30Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging
Applications [20.339673903885483]
Large Language Models (LLMs) have demonstrated strong natural language processing and code synthesis capabilities.
Details about LLM training data are often not made public, which has caused concern as to whether existing bug benchmarks are included.
We present the GitHub Recent Bugs dataset, which includes 76 real-world Java bugs that were gathered after the OpenAI data cut-off point.
arXiv Detail & Related papers (2023-10-20T02:37:44Z) - SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [80.52201658231895]
SWE-bench is an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories.
We show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues.
arXiv Detail & Related papers (2023-10-10T16:47:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.