Large Language Models are Few-shot Testers: Exploring LLM-based General
Bug Reproduction
- URL: http://arxiv.org/abs/2209.11515v3
- Date: Tue, 25 Jul 2023 03:47:36 GMT
- Title: Large Language Models are Few-shot Testers: Exploring LLM-based General
Bug Reproduction
- Authors: Sungmin Kang, Juyeon Yoon, Shin Yoo
- Abstract summary: The number of tests added in open source repositories due to issues was about 28% of the corresponding project test suite size.
We propose LIBRO, a framework that uses Large Language Models (LLMs), which have been shown to be capable of performing code-related tasks.
Our evaluation of LIBRO shows that, on the widely studied Defects4J benchmark, LIBRO can generate failure reproducing test cases for 33% of all studied cases.
- Score: 14.444294152595429
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many automated test generation techniques have been developed to aid
developers with writing tests. To facilitate full automation, most existing
techniques aim to either increase coverage, or generate exploratory inputs.
However, existing test generation techniques largely fall short of achieving
more semantic objectives, such as generating tests to reproduce a given bug
report. Reproducing bugs is nonetheless important, as our empirical study shows
that the number of tests added in open source repositories due to issues was
about 28% of the corresponding project test suite size. Meanwhile, due to the
difficulties of transforming the expected program semantics in bug reports into
test oracles, existing failure reproduction techniques tend to deal exclusively
with program crashes, a small subset of all bug reports. To automate test
generation from general bug reports, we propose LIBRO, a framework that uses
Large Language Models (LLMs), which have been shown to be capable of performing
code-related tasks. Since LLMs themselves cannot execute the target buggy code,
we focus on post-processing steps that help us discern when LLMs are effective,
and rank the produced tests according to their validity. Our evaluation of
LIBRO shows that, on the widely studied Defects4J benchmark, LIBRO can generate
failure reproducing test cases for 33% of all studied cases (251 out of 750),
while suggesting a bug reproducing test in first place for 149 bugs. To
mitigate data contamination, we also evaluate LIBRO against 31 bug reports
submitted after the collection of the LLM training data terminated: LIBRO
produces bug reproducing tests for 32% of the studied bug reports. Overall, our
results show LIBRO has the potential to significantly enhance developer
efficiency by automatically generating tests from bug reports.
Related papers
- ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms [48.43237545197775]
Unit test generation has become a promising and important use case of LLMs.
ProjectTest is a project-level benchmark for unit test generation covering Python, Java, and JavaScript.
arXiv Detail & Related papers (2025-02-10T15:24:30Z) - Design choices made by LLM-based test generators prevent them from finding bugs [0.850206009406913]
This paper critically examines whether recent LLM-based test generation tools, such as Codium CoverAgent and CoverUp, can effectively find bugs or unintentionally validate faulty code.
Using real human-written buggy code as input, we evaluate these tools, showing how LLM-generated tests can fail to detect bugs and, more alarmingly, how their design can worsen the situation by validating bugs in the generated test suite and rejecting bug-revealing tests.
arXiv Detail & Related papers (2024-12-18T18:33:26Z) - TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration [7.833381226332574]
Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.
We propose TestART, a novel unit test generation method.
TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
arXiv Detail & Related papers (2024-08-06T10:52:41Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Leveraging Large Language Models for Efficient Failure Analysis in Game Development [47.618236610219554]
This paper proposes a new approach to automatically identify which change in the code caused a test to fail.
The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure.
Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year.
arXiv Detail & Related papers (2024-06-11T09:21:50Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - Evaluating Diverse Large Language Models for Automatic and General Bug
Reproduction [12.851941377433285]
Large language models (LLMs) have been demonstrated to be adept at natural language processing and code generation.
Our proposed technique LIBRO could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark.
arXiv Detail & Related papers (2023-11-08T08:42:30Z) - Automatic Generation of Test Cases based on Bug Reports: a Feasibility
Study with Large Language Models [4.318319522015101]
Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications.
Most testing procedures still rely on test cases written by humans to form test suites.
We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
arXiv Detail & Related papers (2023-10-10T05:30:12Z) - Effective Test Generation Using Pre-trained Large Language Models and
Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs.
MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs)
Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.