Large Language Models of Code Fail at Completing Code with Potential
Bugs
- URL: http://arxiv.org/abs/2306.03438v2
- Date: Fri, 1 Dec 2023 01:27:37 GMT
- Title: Large Language Models of Code Fail at Completing Code with Potential
Bugs
- Authors: Tuan Dinh, Jinman Zhao, Samson Tan, Renato Negrinho, Leonard Lausen,
Sheng Zha, George Karypis
- Abstract summary: We study the buggy-code completion problem inspired by real-time code suggestion.
We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs.
- Score: 30.80172644795715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models of code (Code-LLMs) have recently brought tremendous
advances to code completion, a fundamental feature of programming assistance
and code intelligence. However, most existing works ignore the possible
presence of bugs in the code context for generation, which are inevitable in
software development. Therefore, we introduce and study the buggy-code
completion problem, inspired by the realistic scenario of real-time code
suggestion where the code context contains potential bugs -- anti-patterns that
can become bugs in the completed program. To systematically study the task, we
introduce two datasets: one with synthetic bugs derived from semantics-altering
operator changes (buggy-HumanEval) and one with realistic bugs derived from
user submissions to coding problems (buggy-FixEval). We find that the presence
of potential bugs significantly degrades the generation performance of the
high-performing Code-LLMs. For instance, the passing rates of CODEGEN-2B-MONO
on test cases of buggy-HumanEval drop more than 50% given a single potential
bug in the context. Finally, we investigate several post-hoc methods for
mitigating the adverse effect of potential bugs and find that there remains a
significant gap in post-mitigation performance.
Related papers
- Bugs in Large Language Models Generated Code: An Empirical Study [12.625305075672456]
Large Language Models (LLMs) for code have gained significant attention recently.
Similar to human-written code, LLM-generated code is prone to bugs.
This paper examines a sample of 333 bugs collected from code generated using three leading LLMs.
arXiv Detail & Related papers (2024-03-13T20:12:01Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - Automated Bug Generation in the era of Large Language Models [6.0770779409377775]
BugFarm transforms arbitrary code into multiple complex bugs.
A comprehensive evaluation of 435k+ bugs from over 1.9M mutants generated by BUGFARM.
arXiv Detail & Related papers (2023-10-03T20:01:51Z) - PreciseBugCollector: Extensible, Executable and Precise Bug-fix
Collection [8.79879909193717]
We introduce PreciseBugCollector, a precise, multi-language bug collection approach.
It is based on two novel components: a bug tracker to map the repositories with external bug repositories to trace bug type information, and a bug injector to generate project-specific bugs.
To date, PreciseBugCollector comprises 1057818 bugs extracted from 2968 open-source projects.
arXiv Detail & Related papers (2023-09-12T13:47:44Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Generation Probabilities Are Not Enough: Uncertainty Highlighting in AI Code Completions [54.55334589363247]
We study whether conveying information about uncertainty enables programmers to more quickly and accurately produce code.
We find that highlighting tokens with the highest predicted likelihood of being edited leads to faster task completion and more targeted edits.
arXiv Detail & Related papers (2023-02-14T18:43:34Z) - Explaining Software Bugs Leveraging Code Structures in Neural Machine
Translation [5.079750706023254]
Bugsplainer generates natural language explanations for software bugs by learning from a large corpus of bug-fix commits.
Our evaluation using three performance metrics shows that Bugsplainer can generate understandable and good explanations according to Google's standard.
We also conduct a developer study involving 20 participants where the explanations from Bugsplainer were found to be more accurate, more precise, more concise and more useful than the baselines.
arXiv Detail & Related papers (2022-12-08T22:19:45Z) - Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers.
We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
arXiv Detail & Related papers (2022-11-11T16:37:33Z) - BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization.
We provide a general benchmark with a diversity of real and synthetic Java bugs.
We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z) - Predicting Vulnerability In Large Codebases With Deep Code
Representation [6.357681017646283]
Software engineers write code for various modules, quite often, various types of errors get introduced.
Same or similar issues/bugs, which were fixed in the past (although in different modules), tend to get introduced in production code again.
We developed a novel AI-based system which uses the deep representation of Abstract Syntax Tree (AST) created from the source code and also the active feedback loop.
arXiv Detail & Related papers (2020-04-24T13:18:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.