Related papers: LLMs are Bug Replicators: An Empirical Study on LLMs' Capability in Completing Bug-prone Code

LLMs are Bug Replicators: An Empirical Study on LLMs' Capability in Completing Bug-prone Code

URL: http://arxiv.org/abs/2503.11082v1
Date: Fri, 14 Mar 2025 04:48:38 GMT
Title: LLMs are Bug Replicators: An Empirical Study on LLMs' Capability in Completing Bug-prone Code
Authors: Liwei Guo, Sixiang Ye, Zeyu Sun, Xiang Chen, Yuxia Zhang, Bo Wang, Jie M. Zhang, Zheng Li, Yong Liu,
Abstract summary: Large Language Models (LLMs) have demonstrated remarkable performance in code completion.<n>This paper presents the first empirical study evaluating the performance of LLMs in completing bug-prone code.
Score: 24.048639099281324
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in code completion. However, the training data used to develop these models often contain a significant amount of buggy code. Yet, it remains unclear to what extent these buggy instances influence LLMs' performance when tackling bug-prone code completion tasks. To fill this gap, this paper presents the first empirical study evaluating the performance of LLMs in completing bug-prone code. Through extensive experiments on 7 LLMs and the Defects4J dataset, we analyze LLMs' accuracy, robustness, and limitations in this challenging context. Our experimental results show that completing bug-prone code is significantly more challenging for LLMs than completing normal code. Notably, in bug-prone tasks, the likelihood of LLMs generating correct code is nearly the same as generating buggy code, and it is substantially lower than in normal code completion tasks (e.g., 12.27% vs. 29.85% for GPT-4). To our surprise, 44.44% of the bugs LLMs make are completely identical to the pre-fix version, indicating that LLMs have been seriously biased by historical bugs when completing code. Additionally, we investigate the effectiveness of existing post-processing techniques and find that while they can improve consistency, they do not significantly reduce error rates in bug-prone code scenarios. Our research highlights the limitations of current LLMs in handling bug-prone code and underscores the need for improved models and post-processing strategies to enhance code completion accuracy in real-world development environments.

Related papers

How Accurately Do Large Language Models Understand Code? [4.817546726074033]
Large Language Models (LLMs) are increasingly used in post-development tasks such as code repair and testing. Quantifying code comprehension is challenging due to its abstract nature and the lack of a standardized metric. This paper presents the first large-scale empirical investigation into LLMs' ability to understand code.
arXiv Detail & Related papers (2025-04-06T05:59:29Z)
Quantizing Large Language Models for Code Generation: A Differentiated Replication [51.85505914274633]
Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language.<n>LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint.<n>New frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70%.
arXiv Detail & Related papers (2025-03-10T09:26:08Z)
Fixing Function-Level Code Generation Errors for Foundation Large Language Models [6.137340149146578]
We conduct an empirical study on the generation errors and conduct an analysis of their causes, leading to 19 categories of error causes. Our empirical analysis indicated that three of these causes can be directly fixed. We propose a fixing method called LlmFix, which addresses these three types of errors through a three-step process.
arXiv Detail & Related papers (2024-09-01T09:40:15Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models [10.519984835232359]
Large Language Models (LLMs) have demonstrated unprecedented capabilities in code generation.<n>We conducted an in-depth analysis of code generation errors across six representative LLMs on the HumanEval dataset.<n>Our findings highlighted several challenges in locating and fixing code generation errors made by LLMs.
arXiv Detail & Related papers (2024-06-13T01:29:52Z)
Bugs in Large Language Models Generated Code: An Empirical Study [12.625305075672456]
Large Language Models (LLMs) for code have gained significant attention recently. Similar to human-written code, LLM-generated code is prone to bugs. This paper examines a sample of 333 bugs collected from code generated using three leading LLMs.
arXiv Detail & Related papers (2024-03-13T20:12:01Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction [12.851941377433285]
Large language models (LLMs) have been demonstrated to be adept at natural language processing and code generation. Our proposed technique LIBRO could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark.
arXiv Detail & Related papers (2023-11-08T08:42:30Z)
The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging Applications [20.339673903885483]
Large Language Models (LLMs) have demonstrated strong natural language processing and code synthesis capabilities. Details about LLM training data are often not made public, which has caused concern as to whether existing bug benchmarks are included. We present the GitHub Recent Bugs dataset, which includes 76 real-world Java bugs that were gathered after the OpenAI data cut-off point.
arXiv Detail & Related papers (2023-10-20T02:37:44Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models [79.01926242857613]
Large language models (LLMs) are prone to hallucinations, generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs. We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts.
arXiv Detail & Related papers (2023-09-07T17:45:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.