When Large Language Models Confront Repository-Level Automatic Program
Repair: How Well They Done?
- URL: http://arxiv.org/abs/2403.00448v1
- Date: Fri, 1 Mar 2024 11:07:41 GMT
- Title: When Large Language Models Confront Repository-Level Automatic Program
Repair: How Well They Done?
- Authors: Yuxiao Chen, Jingzheng Wu, Xiang Ling, Changjiang Li, Zhiqing Rui,
Tianyue Luo, Yanjun Wu
- Abstract summary: We introduce RepoBugs, a new benchmark comprising 124 typical repository-level bugs from open-source repositories.
Preliminary experiments using GPT3.5 based on the function where the error is located, reveal that the repair rate on RepoBugs is only 22.58%.
We propose a simple and universal repository-level context extraction method (RLCE) designed to provide more precise context for repository-level code repair tasks.
- Score: 13.693311241492827
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, large language models (LLMs) have demonstrated substantial
potential in addressing automatic program repair (APR) tasks. However, the
current evaluation of these models for APR tasks focuses solely on the limited
context of the single function or file where the bug is located, overlooking
the valuable information in the repository-level context. This paper
investigates the performance of popular LLMs in handling repository-level
repair tasks. We introduce RepoBugs, a new benchmark comprising 124 typical
repository-level bugs from open-source repositories. Preliminary experiments
using GPT3.5 based on the function where the error is located, reveal that the
repair rate on RepoBugs is only 22.58%, significantly diverging from the
performance of GPT3.5 on function-level bugs in related studies. This
underscores the importance of providing repository-level context when
addressing bugs at this level. However, the repository-level context offered by
the preliminary method often proves redundant and imprecise and easily exceeds
the prompt length limit of LLMs. To solve the problem, we propose a simple and
universal repository-level context extraction method (RLCE) designed to provide
more precise context for repository-level code repair tasks. Evaluations of
three mainstream LLMs show that RLCE significantly enhances the ability to
repair repository-level bugs. The improvement reaches a maximum of 160%
compared to the preliminary method. Additionally, we conduct a comprehensive
analysis of the effectiveness and limitations of RLCE, along with the capacity
of LLMs to address repository-level bugs, offering valuable insights for future
research.
Related papers
- Where's the Bug? Attention Probing for Scalable Fault Localization [18.699014321422023]
We present Bug Attention Probe (BAP), a method which learns state-of-the-art fault localization without any direct localization labels.
BAP is significantly more efficient than prompting, outperforming large open-weight models at a small fraction of the computational cost.
arXiv Detail & Related papers (2025-02-19T18:59:32Z) - RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing [8.846583362353169]
This work introduces an autonomous LLM-agent, RepoAudit, to enable precise and efficient repository-level code auditing.
RepoAudit explores the code repository on demand, analyzing data-flow facts along different feasible program paths in individual functions.
Our experiment shows that RepoAudit successfully finds 38 true bugs in 15 real-world systems, consuming 0.44 hours and $2.54 per project on average.
arXiv Detail & Related papers (2025-01-30T05:56:30Z) - A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models [11.087034068992653]
FAUN-Eval is a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs.
It is constructed using a dataset curated from 30 well-known GitHub repositories.
We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models.
arXiv Detail & Related papers (2024-11-27T03:25:44Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - On the Impacts of Contexts on Repository-Level Code Generation [5.641402231731082]
We present RepoExec, a novel benchmark designed to evaluate repository-level code generation.
We focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts.
arXiv Detail & Related papers (2024-06-17T10:45:22Z) - How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE)
We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories.
To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [4.767858874370881]
We introduce RepoClassBench, a benchmark designed to rigorously evaluate LLMs in generating class-level code within real-world repositories.
RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories.
We introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context.
arXiv Detail & Related papers (2024-04-22T03:52:54Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - See, Say, and Segment: Teaching LMMs to Overcome False Premises [67.36381001664635]
We propose a cascading and joint training approach for LMMs to solve this task.
Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, and finally "segment" by outputting the mask of the desired objects if they exist.
arXiv Detail & Related papers (2023-12-13T18:58:04Z) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models.
We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z) - Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking.
This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.