GitHub Copilot: the perfect Code compLeeter?
- URL: http://arxiv.org/abs/2406.11326v1
- Date: Mon, 17 Jun 2024 08:38:29 GMT
- Title: GitHub Copilot: the perfect Code compLeeter?
- Authors: Ilja Siroš, Dave Singelée, Bart Preneel,
- Abstract summary: This paper aims to evaluate GitHub Copilot's generated code quality based on the LeetCode problem set.
We evaluate Copilot's reliability in the code generation stage, the correctness of the generated code and its dependency on the programming language.
- Score: 3.708656266586145
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper aims to evaluate GitHub Copilot's generated code quality based on the LeetCode problem set using a custom automated framework. We evaluate the results of Copilot for 4 programming languages: Java, C++, Python3 and Rust. We aim to evaluate Copilot's reliability in the code generation stage, the correctness of the generated code and its dependency on the programming language, problem's difficulty level and problem's topic. In addition to that, we evaluate code's time and memory efficiency and compare it to the average human results. In total, we generate solutions for 1760 problems for each programming language and evaluate all the Copilot's suggestions for each problem, resulting in over 50000 submissions to LeetCode spread over a 2-month period. We found that Copilot successfully solved most of the problems. However, Copilot was rather more successful in generating code in Java and C++ than in Python3 and Rust. Moreover, in case of Python3 Copilot proved to be rather unreliable in the code generation phase. We also discovered that Copilot's top-ranked suggestions are not always the best. In addition, we analysed how the topic of the problem impacts the correctness rate. Finally, based on statistics information from LeetCode, we can conclude that Copilot generates more efficient code than an average human.
Related papers
- Exploring the Effect of Multiple Natural Languages on Code Suggestion
Using GitHub Copilot [46.822148186169144]
GitHub Copilot is an AI-enabled tool that automates program synthesis.
Recent studies have extensively examined Copilot's capabilities in various programming tasks.
However, little is known about the effect of different natural languages on code suggestion.
arXiv Detail & Related papers (2024-02-02T14:30:02Z) - Copilot-in-the-Loop: Fixing Code Smells in Copilot-Generated Python Code using Copilot [2.3353795064263543]
Python experiences a decrease in readability and maintainability when code smells are present.
Recent advancements in Large Language Models have sparked growing interest in AI-enabled tools for both code generation and understanding.
GitHub Copilot is one such tool that has gained widespread usage.
Copilot Chat, released in September 2023, functions as an interactive tool aimed at facilitating natural language-powered coding.
arXiv Detail & Related papers (2024-01-25T13:39:54Z) - Exploring the Problems, their Causes and Solutions of AI Pair Programming: A Study on GitHub and Stack Overflow [6.724815667295355]
GitHub Copilot, the AI programmer pair, utilize machine learning models trained on a large corpus of code snippets to generate code suggestions.
Despite its popularity in software development, there is limited empirical evidence on the actual experiences of practitioners who work with Copilot.
We collected data from 473 GitHub issues, 706 GitHub discussions, and 142 Stack Overflow posts.
arXiv Detail & Related papers (2023-11-02T06:24:38Z) - A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z) - Measuring the Runtime Performance of Code Produced with GitHub Copilot [1.6021036144262577]
We evaluate the runtime performance of code produced when developers use GitHub Copilot versus when they do not.
Our results suggest that using Copilot may produce code with a significantly slower runtime performance.
arXiv Detail & Related papers (2023-05-10T20:14:52Z) - DS-1000: A Natural and Reliable Benchmark for Data Science Code
Generation [70.96868419971756]
DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries.
First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow.
Second, our automatic evaluation is highly specific (reliable) -- across all Codex-predicted solutions that our evaluation accept, only 1.8% of them are incorrect.
arXiv Detail & Related papers (2022-11-18T17:20:27Z) - GitHub Copilot AI pair programmer: Asset or Liability? [14.572381978575182]
We study the capabilities of Copilot in two different programming tasks.
We compare Copilot's proposed solutions with those of human programmers on a set of programming tasks.
The results show that Copilot is capable of providing solutions for almost all fundamental algorithmic problems.
arXiv Detail & Related papers (2022-06-30T15:00:03Z) - AVATAR: A Parallel Corpus for Java-Python Program Translation [77.86173793901139]
Program translation refers to migrating source code from one language to another.
We present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python.
arXiv Detail & Related papers (2021-08-26T05:44:20Z) - An Empirical Cybersecurity Evaluation of GitHub Copilot's Code
Contributions [8.285068188878578]
GitHub Copilot is a language model trained over open-source GitHub code.
Code often contains bugs - and so, it is certain that the language model will have learned from exploitable, buggy code.
This raises concerns on the security of Copilot's code contributions.
arXiv Detail & Related papers (2021-08-20T17:30:33Z) - Break-It-Fix-It: Unsupervised Learning for Program Repair [90.55497679266442]
We propose a new training approach, Break-It-Fix-It (BIFI), which has two key ideas.
We use the critic to check a fixer's output on real bad inputs and add good (fixed) outputs to the training data.
Based on these ideas, we iteratively update the breaker and the fixer while using them in conjunction to generate more paired data.
BIFI outperforms existing methods, obtaining 90.5% repair accuracy on GitHub-Python and 71.7% on DeepFix.
arXiv Detail & Related papers (2021-06-11T20:31:04Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.