A Deep Dive into Large Language Models for Automated Bug Localization and Repair
- URL: http://arxiv.org/abs/2404.11595v3
- Date: Fri, 10 May 2024 16:36:52 GMT
- Title: A Deep Dive into Large Language Models for Automated Bug Localization and Repair
- Authors: Soneya Binta Hossain, Nan Jiang, Qiang Zhou, Xiaopeng Li, Wen-Hao Chiang, Yingjun Lyu, Hoan Nguyen, Omer Tripp,
- Abstract summary: Large language models (LLMs) have shown impressive effectiveness in various software engineering tasks, including automated program repair (APR)
In this study, we take a deep dive into automated bug fixing utilizing LLMs.
This methodological separation of bug localization and fixing using different LLMs enables effective integration of diverse contextual information.
Toggle achieves the new state-of-the-art (SOTA) performance on the CodeXGLUE code refinement benchmark.
- Score: 12.756202755547024
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have shown impressive effectiveness in various software engineering tasks, including automated program repair (APR). In this study, we take a deep dive into automated bug fixing utilizing LLMs. In contrast to many deep learning-based APR methods that assume known bug locations, rely on line-level localization tools, or address bug prediction and fixing in one step, our approach uniquely employs LLMs to predict bug location at the token level and subsequently utilizes them for bug fixing. This methodological separation of bug localization and fixing using different LLMs enables effective integration of diverse contextual information and improved incorporation of inductive biases. We introduce Toggle: Token-Granulated Bug Localization and Repair, a comprehensive program repair framework that integrates a bug localization model, an adjustment unit, and a bug-fixing model. Toggle takes a buggy function as input and generates a complete corrected function. We investigate various styles of prompting to the bug fixing model to identify the most effective prompts that better utilize the inductive bias and significantly outperform others. Toggle achieves the new state-of-the-art (SOTA) performance on the CodeXGLUE code refinement benchmark, and exhibits better and comparable performance on several other widely-used APR datasets, including Defects4J.
Related papers
- Supporting Cross-language Cross-project Bug Localization Using Pre-trained Language Models [2.5121668584771837]
Existing techniques often struggle with generalizability and deployment due to their reliance on application-specific data.
This paper proposes a novel pre-trained language model (PLM) based technique for bug localization that transcends project and language boundaries.
arXiv Detail & Related papers (2024-07-03T01:09:36Z) - LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement [93.38736019287224]
"LLMs-as-Instructors" framework autonomously enhances the training of smaller target models.
Inspired by the theory of "Learning from Errors", this framework employs an instructor LLM to meticulously analyze the specific errors within a target model.
Within this framework, we implement two strategies: "Learning from Error," which focuses solely on incorrect responses to tailor training data, and "Learning from Error by Contrast", which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors.
arXiv Detail & Related papers (2024-06-29T17:16:04Z) - AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.
Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.
We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z) - An Empirical Evaluation of Pre-trained Large Language Models for Repairing Declarative Formal Specifications [5.395614997568524]
This paper presents a systematic investigation into the capacity of Large Language Models (LLMs) for repairing declarative specifications in Alloy.
We propose a novel repair pipeline that integrates a dual-agent LLM framework, comprising a Repair Agent and a Prompt Agent.
Our study reveals that LLMs, particularly GPT-4 variants, outperform existing techniques in terms of repair efficacy, albeit with a marginal increase in runtime and token usage.
arXiv Detail & Related papers (2024-04-17T03:46:38Z) - An Empirical Study of Automated Vulnerability Localization with Large Language Models [21.84971967029474]
Large Language Models (LLMs) have shown potential in various domains, yet their effectiveness in vulnerability localization remains underexplored.
Our investigation encompasses 10+ leading LLMs suitable for code analysis, including ChatGPT and various open-source models.
We explore the efficacy of these LLMs using 4 distinct paradigms: zero-shot learning, one-shot learning, discriminative fine-tuning, and generative fine-tuning.
arXiv Detail & Related papers (2024-03-30T08:42:10Z) - A Novel Approach for Automatic Program Repair using Round-Trip
Translation with Large Language Models [50.86686630756207]
Research shows that grammatical mistakes in a sentence can be corrected by translating it to another language and back.
Current generative models for Automatic Program Repair (APR) are pre-trained on source code and fine-tuned for repair.
This paper proposes bypassing the fine-tuning step and using Round-Trip Translation (RTT): translation of code from one programming language to another programming or natural language, and back.
arXiv Detail & Related papers (2024-01-15T22:36:31Z) - On Task Performance and Model Calibration with Supervised and
Self-Ensembled In-Context Learning [71.44986275228747]
In-context learning (ICL) has become an efficient approach propelled by the recent advancements in large language models (LLMs)
However, both paradigms are prone to suffer from the critical problem of overconfidence (i.e., miscalibration)
arXiv Detail & Related papers (2023-12-21T11:55:10Z) - STEAM: Simulating the InTeractive BEhavior of ProgrAMmers for Automatic
Bug Fixing [37.70518599085676]
We introduce a novel stage-wise framework named STEAM to simulate the collaborative nature of bug resolution.
We decompose the bug fixing task into four distinct stages: bug reporting, bug diagnosis, patch generation, and patch verification.
Our evaluation on the widely adopted bug-fixing benchmark demonstrates that STEAM has achieved a new state-of-the-art level of bug-fixing performance.
arXiv Detail & Related papers (2023-08-28T09:56:12Z) - Automatically Correcting Large Language Models: Surveying the landscape
of diverse self-correction strategies [104.32199881187607]
Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks.
A promising approach to rectify these flaws is self-correction, where the LLM itself is prompted or guided to fix problems in its own output.
This paper presents a comprehensive review of this emerging class of techniques.
arXiv Detail & Related papers (2023-08-06T18:38:52Z) - BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization.
We provide a general benchmark with a diversity of real and synthetic Java bugs.
We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z) - Adversarial Patch Generation for Automated Program Repair [0.0]
NEVERMORE is a novel learning-based mechanism inspired by the adversarial nature of bugs and fixes.
NEVERMORE is built upon the Generative Adrial Networks architecture and trained on historical bug fixes to generate repairs that closely mimic human-produced fixes.
Our empirical evaluation on 500 real-world bugs demonstrates the effectiveness of NEVERMORE in bug-fixing, generating repairs that match human fixes for 21.2% of the examined bugs.
arXiv Detail & Related papers (2020-12-21T00:34:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.