No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT
- URL: http://arxiv.org/abs/2308.04838v2
- Date: Sat, 13 Apr 2024 04:58:47 GMT
- Title: No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT
- Authors: Zhijie Liu, Yutian Tang, Xiapu Luo, Yuming Zhou, Liang Feng Zhang,
- Abstract summary: This study examines the quality of code generation using ChatGPT.
We leverage 728 algorithm problems in five languages (i.e., C, C++, Java, Python, and JavaScript) and 18 CWEs with 54 code scenarios for the code generation task.
Our findings uncover potential issues and limitations that arise in the ChatGPT-based code generation.
- Score: 28.68768157452352
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated impressive capabilities across various NLP tasks. Additionally, LLMs are also highly valuable in supporting software engineering tasks, particularly in the field of code generation. Automatic code generation is a process of automatically generating source code or executable code based on given specifications or requirements, improving developer productivity. In this study, we perform a systematic empirical assessment to the quality of code generation using ChatGPT. We leverage 728 algorithm problems in five languages (i.e., C, C++, Java, Python, and JavaScript) and 18 CWEs with 54 code scenarios for the code generation task. Our evaluation encompasses a comprehensive analysis of code snippets generated by ChatGPT, focusing on three critical aspects: correctness, complexity, and security. We also specifically investigate ChatGPT's ability to engage in multi-round fixing process (i.e., ChatGPT's dialog ability) of facilitating code generation. By delving into the generated code and examining the experimental results, this work provides valuable insights into the performance of ChatGPT in tackling code generation tasks over the three critical aspects. Overall, our findings uncover potential issues and limitations that arise in the ChatGPT-based code generation and lay the groundwork for improving AI and LLM-based code generation techniques.
Related papers
- Distinguishing LLM-generated from Human-written Code by Contrastive Learning [5.553326595990857]
Large language models (LLMs) have attracted significant attention due to their demonstrated ability to generate high-quality content for various tasks.
There are growing concerns regarding their potential risks in various fields, such as news, education, and software engineering.
This paper proposes a novel ChatGPT-generated code detector, CodeGPTSensor, based on a contrastive learning framework and a semantic encoder built with UniXcoder.
arXiv Detail & Related papers (2024-11-07T13:39:14Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code.
We find that code prompting exhibits a high-performance boost for multiple LLMs.
Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z) - Exploring the Potential of ChatGPT in Automated Code Refinement: An
Empirical Study [0.0]
ChatGPT, a cutting-edge language model, has demonstrated impressive performance in various natural language processing tasks.
We conduct the first empirical study to understand the capabilities of ChatGPT in code review tasks.
Our results show that ChatGPT achieves higher EM and BLEU scores of 22.78 and 76.44 respectively, while the state-of-the-art method achieves only 15.50 and 62.88 on a high-quality code review dataset.
arXiv Detail & Related papers (2023-09-15T07:41:33Z) - FacTool: Factuality Detection in Generative AI -- A Tool Augmented
Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models.
We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z) - Refining ChatGPT-Generated Code: Characterizing and Mitigating Code
Quality Issues [17.7880460531813]
We systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages.
We identify and characterize potential issues with the quality of ChatGPT-generated code.
We find that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement.
arXiv Detail & Related papers (2023-07-24T08:14:22Z) - Think Outside the Code: Brainstorming Boosts Large Language Models in
Code Generation [9.904734169174356]
In this paper, we introduce Brainstorm framework for code generation.
It leverages a brainstorming step that generates and selects diverse thoughts on the problem.
Brainstorm significantly enhances the ability of LLMs to solve competition-level programming problems.
arXiv Detail & Related papers (2023-05-18T03:32:54Z) - Improving ChatGPT Prompt for Code Generation [13.303599826870705]
OpenAI's language model ChatGPT has emerged as a powerful tool for generating human-like responses to a wide range of textual inputs.
We evaluate ChatGPT's capabilities for two code generation tasks, including text-to-code and code-to-code generation.
Our results showed that by carefully designing prompts to guide ChatGPT, the generation performance can be improved substantially.
arXiv Detail & Related papers (2023-05-15T05:37:33Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.