Related papers: Comment Traps: How Defective Commented-out Code Augment Defects in AI-Assisted Code Generation

Comment Traps: How Defective Commented-out Code Augment Defects in AI-Assisted Code Generation

URL: http://arxiv.org/abs/2512.20334v1
Date: Tue, 23 Dec 2025 13:08:19 GMT
Title: Comment Traps: How Defective Commented-out Code Augment Defects in AI-Assisted Code Generation
Authors: Yuan Huang, Yukang Zhou, Xiangping Chen, Zibin Zheng,
Abstract summary: GitHub Copilot and Cursor are revolutionizing software development practices.<n>Previous research has predominantly examined how code context influences the generation of defective code.<n>This study evaluates how AI coding assistants, GitHub Copilot and Cursor, are influenced by defective CO code.
Score: 40.52928802861937
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid development of large language models in code generation, AI-powered editors such as GitHub Copilot and Cursor are revolutionizing software development practices. At the same time, studies have identified potential defects in the generated code. Previous research has predominantly examined how code context influences the generation of defective code, often overlooking the impact of defects within commented-out code (CO code). AI coding assistants' interpretation of CO code in prompts affects the code they generate. This study evaluates how AI coding assistants, GitHub Copilot and Cursor, are influenced by defective CO code. The experimental results show that defective CO code in the context causes AI coding assistants to generate more defective code, reaching up to 58.17 percent. Our findings further demonstrate that the tools do not simply copy the defective code from the context. Instead, they actively reason to complete incomplete defect patterns and continue to produce defective code despite distractions such as incorrect indentation or tags. Even with explicit instructions to ignore the defective CO code, the reduction in defects does not exceed 21.84 percent. These findings underscore the need for improved robustness and security measures in AI coding assistants.

Related papers

Will It Survive? Deciphering the Fate of AI-Generated Code in Open Source [3.6525095710982924]
A prevailing hypothesis suggests that code is "disposable", meaning it is merged quickly but discarded shortly thereafter.<n>We investigate this hypothesis through survival analysis of 201 open-source projects, tracking over 200,000 code remediation units authored by AI agents versus humans.
arXiv Detail & Related papers (2026-01-23T15:00:46Z)
A Survey of Bugs in AI-Generated Code [7.6152117373301875]
Several quality issues associated with AI-generated code have been reported, including bugs and defects.<n>This paper systematically analyzes the existing AI-generated code literature to establish an overall understanding of bugs and defects in generated code.
arXiv Detail & Related papers (2025-12-04T20:35:59Z)
Vibe Coding in Practice: Motivations, Challenges, and a Future Outlook -- a Grey Literature Review [2.5195922470930614]
Vibe coding is the practice where users rely on AI code generation tools through intuition and trial-and-error without necessarily understanding the underlying code.<n>No research has systematically investigated why users engage in vibe coding, what they experience while doing so, and how they approach quality assurance (QA) and perceive the quality of the AI-generated code.<n>Our analysis reveals a speed-quality trade-off paradox, where vibe coders are motivated by speed and accessibility, often experiencing rapid instant success and flow'', yet most perceive the resulting code as fast but flawed.
arXiv Detail & Related papers (2025-09-30T22:35:00Z)
DeputyDev -- AI Powered Developer Assistant: Breaking the Code Review Logjam through Contextual AI to Boost Developer Productivity [38.585498338645856]
This study investigates the implementation and efficacy of DeputyDev.<n>DeputyDev is an AI-powered code review assistant developed to address inefficiencies in the software development process.
arXiv Detail & Related papers (2025-08-13T10:09:45Z)
RedCode: Risky Code Execution and Generation Benchmark for Code Agents [50.81206098588923]
RedCode is a benchmark for risky code execution and generation. RedCode-Exec provides challenging prompts that could lead to risky code execution. RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions.
arXiv Detail & Related papers (2024-11-12T13:30:06Z)
Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub. 83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z)
Testing the Accuracy of Surface Code Decoders [55.616364225463066]
Large-scale, fault-tolerant quantum computations will be enabled by quantum error-correcting codes (QECC) This work presents the first systematic technique to test the accuracy and effectiveness of different QECC decoding schemes.
arXiv Detail & Related papers (2023-11-21T10:22:08Z)
COCO: Testing Code Generation Systems via Concretized Instructions [33.13427092832396]
COCO is a technique to test the robustness of code generation systems. It exploits the usage scenario of code generation systems to make the original programming instruction more concrete. We evaluated COCO on eight advanced code generation systems, including commercial tools such as Copilot and ChatGPT.
arXiv Detail & Related papers (2023-08-25T11:49:27Z)
Large Language Models of Code Fail at Completing Code with Potential Bugs [30.80172644795715]
We study the buggy-code completion problem inspired by real-time code suggestion. We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs.
arXiv Detail & Related papers (2023-06-06T06:35:27Z)
Generation Probabilities Are Not Enough: Uncertainty Highlighting in AI Code Completions [54.55334589363247]
We study whether conveying information about uncertainty enables programmers to more quickly and accurately produce code. We find that highlighting tokens with the highest predicted likelihood of being edited leads to faster task completion and more targeted edits.
arXiv Detail & Related papers (2023-02-14T18:43:34Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.