Evaluating Diverse Large Language Models for Automatic and General Bug
Reproduction
- URL: http://arxiv.org/abs/2311.04532v2
- Date: Thu, 9 Nov 2023 02:19:20 GMT
- Title: Evaluating Diverse Large Language Models for Automatic and General Bug
Reproduction
- Authors: Sungmin Kang, Juyeon Yoon, Nargiz Askarbekkyzy, Shin Yoo
- Abstract summary: Large language models (LLMs) have been demonstrated to be adept at natural language processing and code generation.
Our proposed technique LIBRO could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark.
- Score: 12.851941377433285
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bug reproduction is a critical developer activity that is also challenging to
automate, as bug reports are often in natural language and thus can be
difficult to transform to test cases consistently. As a result, existing
techniques mostly focused on crash bugs, which are easier to automatically
detect and verify. In this work, we overcome this limitation by using large
language models (LLMs), which have been demonstrated to be adept at natural
language processing and code generation. By prompting LLMs to generate
bug-reproducing tests, and via a post-processing pipeline to automatically
identify promising generated tests, our proposed technique LIBRO could
successfully reproduce about one-third of all bugs in the widely used Defects4J
benchmark. Furthermore, our extensive evaluation on 15 LLMs, including 11
open-source LLMs, suggests that open-source LLMs also demonstrate substantial
potential, with the StarCoder LLM achieving 70% of the reproduction performance
of the closed-source OpenAI LLM code-davinci-002 on the large Defects4J
benchmark, and 90% of performance on a held-out bug dataset likely not part of
any LLM's training data. In addition, our experiments on LLMs of different
sizes show that bug reproduction using LIBRO improves as LLM size increases,
providing information as to which LLMs can be used with the LIBRO pipeline.
Related papers
- SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks.
We show that even the most prominent LLMs exhibit these error patterns in their outputs.
Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z) - ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation [31.363781211927947]
Large language models (LLMs) have achieved impressive performance in code generation.
LLMs are susceptible to error accumulation during code generation.
We propose ROCODE, which integrates the backtracking mechanism and program analysis into LLMs for code generation.
arXiv Detail & Related papers (2024-11-11T16:39:13Z) - Improving the Ability of Pre-trained Language Model by Imparting Large Language Model's Experience [4.814313782484443]
Large Language Models (LLMs) and pre-trained Language Models (LMs) have achieved impressive success on many software engineering tasks.
We use LLMs to generate domain-specific data, thereby improving the performance of pre-trained LMs on the target tasks.
arXiv Detail & Related papers (2024-08-16T06:37:59Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - LLatrieval: LLM-Verified Retrieval for Verifiable Generation [67.93134176912477]
Verifiable generation aims to let the large language model (LLM) generate text with supporting documents.
We propose LLatrieval (Large Language Model Verified Retrieval), where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question.
Experiments show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-11-14T01:38:02Z) - The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging
Applications [20.339673903885483]
Large Language Models (LLMs) have demonstrated strong natural language processing and code synthesis capabilities.
Details about LLM training data are often not made public, which has caused concern as to whether existing bug benchmarks are included.
We present the GitHub Recent Bugs dataset, which includes 76 real-world Java bugs that were gathered after the OpenAI data cut-off point.
arXiv Detail & Related papers (2023-10-20T02:37:44Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Towards Generating Functionally Correct Code Edits from Natural Language
Issue Descriptions [11.327913840111378]
We introduce Defects4J-NL2Fix, a dataset of 283 Java programs from the popular Defects4J dataset augmented with high-level descriptions of bug fixes.
We empirically evaluate the performance of several state-of-the-art LLMs for the this task.
arXiv Detail & Related papers (2023-04-07T18:58:33Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z) - Large Language Models are Few-shot Testers: Exploring LLM-based General
Bug Reproduction [14.444294152595429]
The number of tests added in open source repositories due to issues was about 28% of the corresponding project test suite size.
We propose LIBRO, a framework that uses Large Language Models (LLMs), which have been shown to be capable of performing code-related tasks.
Our evaluation of LIBRO shows that, on the widely studied Defects4J benchmark, LIBRO can generate failure reproducing test cases for 33% of all studied cases.
arXiv Detail & Related papers (2022-09-23T10:50:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.