Related papers: On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions

On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions

URL: http://arxiv.org/abs/2403.15149v1
Date: Fri, 22 Mar 2024 12:05:18 GMT
Title: On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions
Authors: Matteo Ciniselli, Alberto Martin-Lopez, Gabriele Bavota,
Abstract summary: Code completion is a key feature of Integrated Development Environments (IDEs) Modern code completion approaches are often powered by deep learning (DL) models. Can these models generalize across different language versions?
Score: 12.245958803682505
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code completion is a key feature of Integrated Development Environments (IDEs), aimed at predicting the next tokens a developer is likely to write, helping them write code faster and with less effort. Modern code completion approaches are often powered by deep learning (DL) models. However, the swift evolution of programming languages poses a critical challenge to the performance of DL-based code completion models: Can these models generalize across different language versions? This paper delves into such a question. In particular, we assess the capabilities of a state-of-the-art model, CodeT5, to generalize across nine different Java versions, ranging from Java 2 to Java 17, while being exclusively trained on Java 8 code. Our evaluation spans three completion scenarios, namely, predicting tokens, constructs (e.g., the condition of an if statement) and entire code blocks. The results of our study reveal a noticeable disparity among language versions, with the worst performance being obtained in Java 2 and 17 - the most far apart versions compared to Java 8. We investigate possible causes for the performance degradation and show that the adoption of a limited version-specific fine-tuning can partially alleviate the problem. Our work raises awareness on the importance of continuous model refinement, and it can inform the design of alternatives to make code completion models more robust to language evolution.

Related papers

Evaluating Programming Language Confusion [6.462594894731934]
Large Language Models for code (Code LLMs) have gained significant traction in software engineering. These models have demonstrated remarkable capabilities in understanding programming concepts, implementing algorithms, and even bridging different programming languages. Despite these advances, Code LLMs often struggle with programming language confusion--producing code in unintended languages.
arXiv Detail & Related papers (2025-03-17T18:14:15Z)
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z)
Multilingual Contrastive Decoding via Language-Agnostic Layers Skipping [60.458273797431836]
Decoding by contrasting layers (DoLa) is designed to improve the generation quality of large language models. We find that this approach does not work well on non-English tasks. Inspired by previous interpretability work on language transition during the model's forward pass, we propose an improved contrastive decoding algorithm.
arXiv Detail & Related papers (2024-07-15T15:14:01Z)
Can Large Language Models Write Parallel Code? [0.5317767988097261]
Large language models are increasingly becoming a popular tool for software development. In this paper, we study the capabilities of state-of-the-art language models to generate parallel code.
arXiv Detail & Related papers (2024-01-23T08:25:12Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z)
Enriching Source Code with Contextual Data for Code Completion Models: An Empirical Study [4.438873396405334]
We aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion. For comments, we find that the models perform better in the presence of multi-line comments.
arXiv Detail & Related papers (2023-04-24T17:09:14Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
AVATAR: A Parallel Corpus for Java-Python Program Translation [77.86173793901139]
Program translation refers to migrating source code from one language to another. We present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python.
arXiv Detail & Related papers (2021-08-26T05:44:20Z)
Towards Full-line Code Completion with Neural Language Models [25.458883198815393]
We discuss the probability of directly completing a whole line of code instead of a single token. Recent neural language models have been adopted as a preferred approach for code completion.
arXiv Detail & Related papers (2020-09-18T03:12:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.