The Fact Selection Problem in LLM-Based Program Repair
- URL: http://arxiv.org/abs/2404.05520v2
- Date: Tue, 9 Apr 2024 10:01:23 GMT
- Title: The Fact Selection Problem in LLM-Based Program Repair
- Authors: Nikhil Parasaram, Huijie Yan, Boyu Yang, Zineb Flahy, Abriele Qudsi, Damian Ziaber, Earl Barr, Sergey Mechtaev,
- Abstract summary: We show that each fact, ranging from simple syntactic details like code context to semantic information previously unexplored in the context of Python projects, is beneficial.
Importantly, we discovered that the effectiveness of program repair prompts is non-monotonic over the number of used facts.
We develop a basic statistical model, named Maniple, which selects facts specific to a given bug to include in the prompt.
- Score: 3.7005619077967133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent research has shown that incorporating bug-related facts, such as stack traces and GitHub issues, into prompts enhances the bug-fixing capabilities of large language models (LLMs). Considering the ever-increasing context window of these models, a critical question arises: what and how many facts should be included in prompts to maximise the chance of correctly fixing bugs? To answer this question, we conducted a large-scale study, employing over 19K prompts featuring various combinations of seven diverse facts to rectify 314 bugs from open-source Python projects within the BugsInPy benchmark. Our findings revealed that each fact, ranging from simple syntactic details like code context to semantic information previously unexplored in the context of LLMs such as angelic values, is beneficial. Specifically, each fact aids in fixing some bugs that would remain unresolved or only be fixed with a low success rate without it. Importantly, we discovered that the effectiveness of program repair prompts is non-monotonic over the number of used facts; using too many facts leads to subpar outcomes. These insights led us to define the fact selection problem: determining the optimal set of facts for inclusion in a prompt to maximise LLM's performance on a given task instance. We found that there is no one-size-fits-all set of facts for bug repair. Therefore, we developed a basic statistical model, named Maniple, which selects facts specific to a given bug to include in the prompt. This model significantly surpasses the performance of the best generic fact set. To underscore the significance of the fact selection problem, we benchmarked Maniple against the state-of-the-art zero-shot, non-conversational LLM-based bug repair methods. On our testing dataset of 157 bugs, Maniple repairs 88 bugs, 17% above the best configuration.
Related papers
- BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning [1.9854146581797698]
BLAZE is an approach that employs dynamic chunking and hard example learning.
It fine-tunes a GPT-based model using challenging bug cases to enhance cross-project and cross-language bug localization.
BLAZE achieves up to an increase of 120% in Top 1 accuracy, 144% in Mean Average Precision (MAP), and 100% in Mean Reciprocal Rank (MRR)
arXiv Detail & Related papers (2024-07-24T20:44:36Z) - LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing [6.042114639413868]
Specialized fuzzers can handle complex structured data, but require additional efforts in grammar and suffer from low throughput.
In this paper, we explore the potential of utilizing the Large Language Model to enhance greybox fuzzing for structured data.
Our LLM-based fuzzer, LLAMAFUZZ, integrates the power of LLM to understand and mutate structured data to fuzzing.
arXiv Detail & Related papers (2024-06-11T20:48:28Z) - A Deep Dive into Large Language Models for Automated Bug Localization and Repair [12.756202755547024]
Large language models (LLMs) have shown impressive effectiveness in various software engineering tasks, including automated program repair (APR)
In this study, we take a deep dive into automated bug fixing utilizing LLMs.
This methodological separation of bug localization and fixing using different LLMs enables effective integration of diverse contextual information.
Toggle achieves the new state-of-the-art (SOTA) performance on the CodeXGLUE code refinement benchmark.
arXiv Detail & Related papers (2024-04-17T17:48:18Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning.
Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education.
We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z) - A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair [19.123640635549524]
Large Language Models (LLMs) have been gaining increasing attention and demonstrated promising performance across a variety of software engineering tasks.
This paper reviews the bug-fixing capabilities of ChatGPT on a clean APR benchmark with different research objectives.
ChatGPT is able to fix 109 out of 151 buggy programs using the basic prompt within 35 independent rounds, outperforming state-of-the-art LLMs CodeT5 and PLBART by 27.5% and 62.4% prediction accuracy.
arXiv Detail & Related papers (2023-10-13T06:11:47Z) - Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs)
We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables.
We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z) - How Predictable Are Large Language Model Capabilities? A Case Study on
BIG-bench [52.11481619456093]
We study the performance prediction problem on experiment records from BIG-bench.
An $R2$ score greater than 95% indicates the presence of learnable patterns within the experiment records.
We find a subset as informative as BIG-bench Hard for evaluating new model families, while being $3times$ smaller.
arXiv Detail & Related papers (2023-05-24T09:35:34Z) - MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop
Questions [80.69639629733484]
We present a benchmark, MQuAKE, comprising multi-hop questions that assess whether edited models correctly answer questions.
We propose a memory-based approach, MeLLo, which stores all edited facts externally while prompting the language model iteratively to generate answers consistent with the edited facts.
arXiv Detail & Related papers (2023-05-24T06:48:41Z) - Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers.
We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
arXiv Detail & Related papers (2022-11-11T16:37:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.