Refining ChatGPT-Generated Code: Characterizing and Mitigating Code
Quality Issues
- URL: http://arxiv.org/abs/2307.12596v2
- Date: Fri, 15 Dec 2023 01:17:35 GMT
- Title: Refining ChatGPT-Generated Code: Characterizing and Mitigating Code
Quality Issues
- Authors: Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit
Tantithamthavorn, Li Li, Xuan-Bach D. Le, David Lo
- Abstract summary: We systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages.
We identify and characterize potential issues with the quality of ChatGPT-generated code.
We find that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement.
- Score: 17.7880460531813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We systematically study the quality of 4,066 ChatGPT-generated code
implemented in two popular programming languages, i.e., Java and Python, for
2,033 programming tasks. The goal of this work is three folds. First, we
analyze the correctness of ChatGPT on code generation tasks and uncover the
factors that influence its effectiveness, including task difficulty,
programming language, time that tasks are introduced, and program size. Second,
we identify and characterize potential issues with the quality of
ChatGPT-generated code. Last, we provide insights into how these issues can be
mitigated. Experiments highlight that out of 4,066 programs generated by
ChatGPT, 2,756 programs are deemed correct, 1,082 programs provide wrong
outputs, and 177 programs contain compilation or runtime errors. Additionally,
we further analyze other characteristics of the generated code through static
analysis tools, such as code style and maintainability, and find that 1,930
ChatGPT-generated code snippets suffer from maintainability issues.
Subsequently, we investigate ChatGPT's self-repairing ability and its
interaction with static analysis tools to fix the errors uncovered in the
previous step. Experiments suggest that ChatGPT can partially address these
challenges, improving code quality by more than 20%, but there are still
limitations and opportunities for improvement. Overall, our study provides
valuable insights into the current limitations of ChatGPT and offers a roadmap
for future research and development efforts to enhance the code generation
capabilities of AI models like ChatGPT.
Related papers
- Fight Fire with Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks? [10.389763758883975]
Recent studies proposed utilizing ChatGPT as both a developer and tester for collaborative software development.
We conduct a comprehensive empirical investigation to evaluate ChatGPT's self-verification capability in code generation, code completion, and program repair.
arXiv Detail & Related papers (2024-05-21T09:47:33Z) - Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples.
One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports.
Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z) - Assessing the Promise and Pitfalls of ChatGPT for Automated Code
Generation [2.0400340435492272]
This paper presents a comprehensive evaluation of the code generation capabilities of ChatGPT, a prominent large language model.
A dataset of 131 code-generation prompts across 5 categories was curated to enable robust analysis.
Code solutions were generated by both ChatGPT and humans for all prompts, resulting in 262 code samples.
arXiv Detail & Related papers (2023-11-05T12:56:40Z) - Exploring the Potential of ChatGPT in Automated Code Refinement: An
Empirical Study [0.0]
ChatGPT, a cutting-edge language model, has demonstrated impressive performance in various natural language processing tasks.
We conduct the first empirical study to understand the capabilities of ChatGPT in code review tasks.
Our results show that ChatGPT achieves higher EM and BLEU scores of 22.78 and 76.44 respectively, while the state-of-the-art method achieves only 15.50 and 62.88 on a high-quality code review dataset.
arXiv Detail & Related papers (2023-09-15T07:41:33Z) - No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT [28.68768157452352]
This study examines the quality of code generation using ChatGPT.
We leverage 728 algorithm problems in five languages (i.e., C, C++, Java, Python, and JavaScript) and 18 CWEs with 54 code scenarios for the code generation task.
Our findings uncover potential issues and limitations that arise in the ChatGPT-based code generation.
arXiv Detail & Related papers (2023-08-09T10:01:09Z) - FacTool: Factuality Detection in Generative AI -- A Tool Augmented
Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models.
We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z) - Unmasking the giant: A comprehensive evaluation of ChatGPT's proficiency in coding algorithms and data structures [0.6990493129893112]
We evaluate ChatGPT's ability to generate correct solutions to the problems fed to it, its code quality, and nature of run-time errors thrown by its code.
We look into patterns in the test cases passed in order to gain some insights into how wrong ChatGPT code is in these kinds of situations.
arXiv Detail & Related papers (2023-07-10T08:20:34Z) - ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time [54.18651663847874]
ChatGPT has achieved great success and can be considered to have acquired an infrastructural status.
Existing benchmarks encounter two challenges: (1) Disregard for periodical evaluation and (2) Lack of fine-grained features.
We construct ChatLog, an ever-updating dataset with large-scale records of diverse long-form ChatGPT responses for 21 NLP benchmarks from March, 2023 to now.
arXiv Detail & Related papers (2023-04-27T11:33:48Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.