Syntax-Aware On-the-Fly Code Completion
- URL: http://arxiv.org/abs/2211.04673v2
- Date: Mon, 1 May 2023 05:07:35 GMT
- Title: Syntax-Aware On-the-Fly Code Completion
- Authors: Wannita Takerngsaksiri, Chakkrit Tantithamthavorn, and Yuan-Fang Li
- Abstract summary: We propose PyCoder to leverage token types, a kind of lightweight syntactic information.
Our PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions.
- Score: 13.268277642411974
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Code completion aims to help improve developers' productivity by suggesting
the next code tokens from a given context. Various approaches have been
proposed to incorporate abstract syntax tree (AST) information for model
training, ensuring that code completion is aware of the syntax of the
programming languages. However, existing syntax-aware code completion
approaches are not on-the-fly, as we found that for every two-thirds of
characters that developers type, AST fails to be extracted because it requires
the syntactically correct source code, limiting its practicality in real-world
scenarios. On the other hand, existing on-the-fly code completion does not
consider syntactic information yet. In this paper, we propose PyCoder to
leverage token types, a kind of lightweight syntactic information, which is
readily available and aligns with the natural order of source code. Our PyCoder
is trained in a multi-task training manner so that by learning the supporting
task of predicting token types during the training phase, the models achieve
better performance on predicting tokens and lines of code without the need for
token types in the inference phase. Comprehensive experiments show that PyCoder
achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12%
for the token-level predictions, which is 0.43%-24.25% more accurate than
baselines. In addition, PyCoder achieves an exact match of 43.37% for the
line-level predictions, which is 3.63%-84.73% more accurate than baselines.
These results lead us to conclude that token type information (an alternative
to syntactic information) that is rarely used in the past can greatly improve
the performance of code completion approaches, without requiring the
syntactically correct source code like AST-based approaches do. Our PyCoder is
publicly available on HuggingFace and GitHub.
Related papers
- LongCoder: A Long-Range Pre-trained Language Model for Code Completion [56.813974784131624]
LongCoder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens.
Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction.
memory tokens are included to highlight important statements that may be invoked later and need to be memorized.
arXiv Detail & Related papers (2023-06-26T17:59:24Z) - Enriching Source Code with Contextual Data for Code Completion Models:
An Empirical Study [4.438873396405334]
We aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion.
For comments, we find that the models perform better in the presence of multi-line comments.
arXiv Detail & Related papers (2023-04-24T17:09:14Z) - A Syntax-Guided Multi-Task Learning Approach for Turducken-Style Code
Generation [19.489202790935902]
We propose a syntax-guided multi-task learning approach TurduckenGen.
Specifically, we first explicitly append the type information to the code tokens to capture the representation of syntactic constraints.
Then we formalize code generation with syntactic constraint representation as an auxiliary task to enable the model to learn the syntactic constraints of the code.
arXiv Detail & Related papers (2023-03-09T06:22:07Z) - Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent.
It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics.
We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - CodeFill: Multi-token Code Completion by Jointly Learning from Structure
and Naming Sequences [7.661675959139121]
We present CodeFill, a language model for autocompletion that combines learned structure and naming information.
CodeFill is trained both for single-token and multi-token (statement) prediction.
To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters.
arXiv Detail & Related papers (2022-02-14T13:26:54Z) - CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained
Model [23.947178895479464]
We propose CLSEBERT, a Constrastive Learning Framework for Syntax Enhanced Code Pre-Trained Model.
In the pre-training stage, we consider the code syntax and hierarchy contained in the Abstract Syntax Tree (AST)
We also introduce two novel pre-training objectives. One is to predict the edges between nodes in the abstract syntax tree, and the other is to predict the types of code tokens.
arXiv Detail & Related papers (2021-08-10T10:08:21Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z) - Contrastive Code Representation Learning [95.86686147053958]
We show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics.
We propose ContraCode: a contrastive pre-training task that learns code functionality, not form.
arXiv Detail & Related papers (2020-07-09T17:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.