Related papers: Copilot-in-the-Loop: Fixing Code Smells in Copilot-Generated Python Code using Copilot

Copilot-in-the-Loop: Fixing Code Smells in Copilot-Generated Python Code using Copilot

URL: http://arxiv.org/abs/2401.14176v2
Date: Wed, 21 Aug 2024 14:57:43 GMT
Title: Copilot-in-the-Loop: Fixing Code Smells in Copilot-Generated Python Code using Copilot
Authors: Beiqi Zhang, Peng Liang, Qiong Feng, Yujia Fu, Zengyang Li,
Abstract summary: Python experiences a decrease in readability and maintainability when code smells are present. Recent advancements in Large Language Models have sparked growing interest in AI-enabled tools for both code generation and understanding. GitHub Copilot is one such tool that has gained widespread usage. Copilot Chat, released in September 2023, functions as an interactive tool aimed at facilitating natural language-powered coding.
Score: 2.3353795064263543
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As one of the most popular dynamic languages, Python experiences a decrease in readability and maintainability when code smells are present. Recent advancements in Large Language Models have sparked growing interest in AI-enabled tools for both code generation and refactoring. GitHub Copilot is one such tool that has gained widespread usage. Copilot Chat, released in September 2023, functions as an interactive tool aimed at facilitating natural language-powered coding. However, limited attention has been given to understanding code smells in Copilot-generated Python code and Copilot Chat's ability to fix the code smells. To this end, we built a dataset comprising 102 code smells in Copilot-generated Python code. Our aim is to first explore the occurrence of code smells in Copilot-generated Python code and then evaluate the effectiveness of Copilot Chat in fixing these code smells employing different prompts. The results show that 8 out of 10 types of code smells can be detected in Copilot-generated Python code, among which Multiply-Nested Container is the most common one. For these code smells, Copilot Chat achieves a highest fixing rate of 87.1%, showing promise in fixing Python code smells generated by Copilot itself. In addition, the effectiveness of Copilot Chat in fixing these smells can be improved by providing more detailed prompts.

Related papers

Copilot Arena: A Platform for Code LLM Evaluation in the Wild [44.33771124408514]
Copilot Arena is a platform to collect user preferences for code generation through native integration into a developer's working environment. Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements.
arXiv Detail & Related papers (2025-02-13T13:40:52Z)
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z)
Where Are Large Language Models for Code Generation on GitHub? [10.389763758883975]
ChatGPT and Copilot are the most frequently utilized for generating code on GitHub. Most ChatGPT/Copilot-generated code snippets are relatively short and exhibit low complexity. modifications due to bugs are even fewer, ranging from just 3% to 8% across different languages.
arXiv Detail & Related papers (2024-06-27T21:47:27Z)
GitHub Copilot: the perfect Code compLeeter? [3.708656266586145]
This paper aims to evaluate GitHub Copilot's generated code quality based on the LeetCode problem set. We evaluate Copilot's reliability in the code generation stage, the correctness of the generated code and its dependency on the programming language.
arXiv Detail & Related papers (2024-06-17T08:38:29Z)
Exploring the Effect of Multiple Natural Languages on Code Suggestion Using GitHub Copilot [46.822148186169144]
GitHub Copilot is an AI-enabled tool that automates program synthesis. Recent studies have extensively examined Copilot's capabilities in various programming tasks. However, little is known about the effect of different natural languages on code suggestion.
arXiv Detail & Related papers (2024-02-02T14:30:02Z)
Demystifying Practices, Challenges and Expected Features of Using GitHub Copilot [3.655281304961642]
We conducted an empirical study by collecting and analyzing the data from Stack Overflow (SO) and GitHub Discussions. We identified the programming languages, technologies used with Copilot, functions implemented, benefits, limitations, and challenges when using Copilot. Our results suggest that using Copilot is like a double-edged sword, which requires developers to carefully consider various aspects when deciding whether or not to use it.
arXiv Detail & Related papers (2023-09-11T16:39:37Z)
InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback [50.725076393314964]
We introduce InterCode, a lightweight, flexible, and easy-to-use framework of interactive coding as a standard reinforcement learning environment. Our framework is language and platform agnostic, uses self-contained Docker environments to provide safe and reproducible execution. We demonstrate InterCode's viability as a testbed by evaluating multiple state-of-the-art LLMs configured with different prompting strategies.
arXiv Detail & Related papers (2023-06-26T17:59:50Z)
A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z)
Level 2 Autonomous Driving on a Single Device: Diving into the Devils of Openpilot [112.21008828205409]
Comma.ai claims one $999 aftermarket device mounted with a single camera and board inside owns the ability to handle L2 scenarios. Together with open-sourced software of the entire system released by Comma.ai, the project is named Openpilot. In this report, we would like to share our latest findings, shed some light on the new perspective of end-to-end autonomous driving from an industrial product-level side.
arXiv Detail & Related papers (2022-06-16T13:43:52Z)
Is GitHub's Copilot as Bad as Humans at Introducing Vulnerabilities in Code? [12.350130201627186]
We perform a comparative empirical analysis of Copilot-generated code from a security perspective. We investigate whether Copilot is just as likely to introduce the same software vulnerabilities as human developers.
arXiv Detail & Related papers (2022-04-10T18:32:04Z)
An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions [8.285068188878578]
GitHub Copilot is a language model trained over open-source GitHub code. Code often contains bugs - and so, it is certain that the language model will have learned from exploitable, buggy code. This raises concerns on the security of Copilot's code contributions.
arXiv Detail & Related papers (2021-08-20T17:30:33Z)
Break-It-Fix-It: Unsupervised Learning for Program Repair [90.55497679266442]
We propose a new training approach, Break-It-Fix-It (BIFI), which has two key ideas. We use the critic to check a fixer's output on real bad inputs and add good (fixed) outputs to the training data. Based on these ideas, we iteratively update the breaker and the fixer while using them in conjunction to generate more paired data. BIFI outperforms existing methods, obtaining 90.5% repair accuracy on GitHub-Python and 71.7% on DeepFix.
arXiv Detail & Related papers (2021-06-11T20:31:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.