Related papers: GitHub Copilot: the perfect Code compLeeter?

GitHub Copilot: the perfect Code compLeeter?

URL: http://arxiv.org/abs/2406.11326v1
Date: Mon, 17 Jun 2024 08:38:29 GMT
Title: GitHub Copilot: the perfect Code compLeeter?
Authors: Ilja Siroš, Dave Singelée, Bart Preneel,
Abstract summary: This paper aims to evaluate GitHub Copilot's generated code quality based on the LeetCode problem set. We evaluate Copilot's reliability in the code generation stage, the correctness of the generated code and its dependency on the programming language.
Score: 3.708656266586145
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper aims to evaluate GitHub Copilot's generated code quality based on the LeetCode problem set using a custom automated framework. We evaluate the results of Copilot for 4 programming languages: Java, C++, Python3 and Rust. We aim to evaluate Copilot's reliability in the code generation stage, the correctness of the generated code and its dependency on the programming language, problem's difficulty level and problem's topic. In addition to that, we evaluate code's time and memory efficiency and compare it to the average human results. In total, we generate solutions for 1760 problems for each programming language and evaluate all the Copilot's suggestions for each problem, resulting in over 50000 submissions to LeetCode spread over a 2-month period. We found that Copilot successfully solved most of the problems. However, Copilot was rather more successful in generating code in Java and C++ than in Python3 and Rust. Moreover, in case of Python3 Copilot proved to be rather unreliable in the code generation phase. We also discovered that Copilot's top-ranked suggestions are not always the best. In addition, we analysed how the topic of the problem impacts the correctness rate. Finally, based on statistics information from LeetCode, we can conclude that Copilot generates more efficient code than an average human.

Related papers

Code with Me or for Me? How Increasing AI Automation Transforms Developer Workflows [66.1850490474361]
We conduct the first academic study to explore developer interactions with coding agents.<n>We evaluate two leading copilot and agentic coding assistants, GitHub Copilot and OpenHands.<n>Our results show agents have the potential to assist developers in ways that surpass copilots.
arXiv Detail & Related papers (2025-07-10T20:12:54Z)
Copilot Arena: A Platform for Code LLM Evaluation in the Wild [44.33771124408514]
Copilot Arena is a platform to collect user preferences for code generation through native integration into a developer's working environment. Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements.
arXiv Detail & Related papers (2025-02-13T13:40:52Z)
Exploring the Effect of Multiple Natural Languages on Code Suggestion Using GitHub Copilot [46.822148186169144]
GitHub Copilot is an AI-enabled tool that automates program synthesis. Recent studies have extensively examined Copilot's capabilities in various programming tasks. However, little is known about the effect of different natural languages on code suggestion.
arXiv Detail & Related papers (2024-02-02T14:30:02Z)
Copilot-in-the-Loop: Fixing Code Smells in Copilot-Generated Python Code using Copilot [2.3353795064263543]
Python experiences a decrease in readability and maintainability when code smells are present. Recent advancements in Large Language Models have sparked growing interest in AI-enabled tools for both code generation and understanding. GitHub Copilot is one such tool that has gained widespread usage. Copilot Chat, released in September 2023, functions as an interactive tool aimed at facilitating natural language-powered coding.
arXiv Detail & Related papers (2024-01-25T13:39:54Z)
Exploring the Problems, their Causes and Solutions of AI Pair Programming: A Study on GitHub and Stack Overflow [6.724815667295355]
GitHub Copilot, the AI programmer pair, utilize machine learning models trained on a large corpus of code snippets to generate code suggestions. Despite its popularity in software development, there is limited empirical evidence on the actual experiences of practitioners who work with Copilot. We collected data from 473 GitHub issues, 706 GitHub discussions, and 142 Stack Overflow posts.
arXiv Detail & Related papers (2023-11-02T06:24:38Z)
A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z)
Measuring the Runtime Performance of Code Produced with GitHub Copilot [1.6021036144262577]
We evaluate the runtime performance of code produced when developers use GitHub Copilot versus when they do not. Our results suggest that using Copilot may produce code with a significantly slower runtime performance.
arXiv Detail & Related papers (2023-05-10T20:14:52Z)
DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation [70.96868419971756]
DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-predicted solutions that our evaluation accept, only 1.8% of them are incorrect.
arXiv Detail & Related papers (2022-11-18T17:20:27Z)
GitHub Copilot AI pair programmer: Asset or Liability? [14.572381978575182]
We study the capabilities of Copilot in two different programming tasks. We compare Copilot's proposed solutions with those of human programmers on a set of programming tasks. The results show that Copilot is capable of providing solutions for almost all fundamental algorithmic problems.
arXiv Detail & Related papers (2022-06-30T15:00:03Z)
AVATAR: A Parallel Corpus for Java-Python Program Translation [77.86173793901139]
Program translation refers to migrating source code from one language to another. We present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python.
arXiv Detail & Related papers (2021-08-26T05:44:20Z)
An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions [8.285068188878578]
GitHub Copilot is a language model trained over open-source GitHub code. Code often contains bugs - and so, it is certain that the language model will have learned from exploitable, buggy code. This raises concerns on the security of Copilot's code contributions.
arXiv Detail & Related papers (2021-08-20T17:30:33Z)
Break-It-Fix-It: Unsupervised Learning for Program Repair [90.55497679266442]
We propose a new training approach, Break-It-Fix-It (BIFI), which has two key ideas. We use the critic to check a fixer's output on real bad inputs and add good (fixed) outputs to the training data. Based on these ideas, we iteratively update the breaker and the fixer while using them in conjunction to generate more paired data. BIFI outperforms existing methods, obtaining 90.5% repair accuracy on GitHub-Python and 71.7% on DeepFix.
arXiv Detail & Related papers (2021-06-11T20:31:04Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.