Related papers: Sequence Model Design for Code Completion in the Modern IDE

Sequence Model Design for Code Completion in the Modern IDE

URL: http://arxiv.org/abs/2004.05249v1
Date: Fri, 10 Apr 2020 22:40:49 GMT
Title: Sequence Model Design for Code Completion in the Modern IDE
Authors: Gareth Ari Aye and Gail E. Kaiser
Abstract summary: We propose a novel design for predicting top-k next tokens that combines static analysis' ability to enumerate all valid keywords and in-scope identifiers with the ability of a language model to place a probability distribution over them. Our model mixes character-level input representation with token output to represent out-of-vocabulary (OOV) tokens meaningfully and minimize prediction latency.
Score: 3.4824234779710452
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code completion plays a prominent role in modern integrated development environments (IDEs). Machine learning has become ubiquitous in analogous natural language writing and search software, surfacing more relevant autocompletions and search suggestions in fewer keystrokes. Prior research has reported training high-accuracy, deep neural networks for modeling source code, but little attention has been given to the practical constraints imposed by interactive developer tools. In particular, neural language models for source code modeling like the one described in Maybe Deep Neural Networks are the Best Choice for Modeling Source Code are framed around code completion, but only report accuracy of next-token prediction. However, in order for a language model (LM) to work well within real-world code completion systems, it must also always make suggestions that produce valid code that typechecks to support code completion's role in correctness-checking; return instantaneous results to help programmers code more efficiently in fewer keystrokes; and be small enough to fit comfortably on disk and in memory on developer workstations, since virtually all modern IDEs run locally and support offline usage. To meet these additional requirements, we propose a novel design for predicting top-k next tokens that combines static analysis' ability to enumerate all valid keywords and in-scope identifiers with the ability of a language model to place a probability distribution over them. Our model mixes character-level input representation with token output to represent out-of-vocabulary (OOV) tokens meaningfully and minimize prediction latency. OOV tokens can be predicted through detection of local repetition common in software. This design achieves state-of-art accuracy in source code modeling and fits the constraints imposed by real-world code completion implementations in modern IDEs.

Related papers

Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction [6.349503549199403]
In this study, we explore the use of a transformer-based language model as an encoder to predict the band gaps of semiconductor materials. We generate material descriptions in two formats: formatted strings combining features and natural language text generated using the ChatGPT API. We demonstrate that the RoBERTa model, pre-trained on natural language processing tasks, performs effectively as an encoder for prediction tasks.
arXiv Detail & Related papers (2025-01-07T00:56:26Z)
Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers. It is common to instead use proxy tasks that are similar in only an informal sense. We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z)
CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language. Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z)
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation. CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z)
Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [66.51005288743153]
We investigate the legal and ethical issues of current neural code completion models. We tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks. We evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models.
arXiv Detail & Related papers (2024-04-22T15:54:53Z)
On-the-Fly Syntax Highlighting: Generalisation and Speed-ups [2.208443815105053]
On-the-fly syntax highlighting is the task of rapidly associating visual secondary notation values with each character of a language derivation. Speed constraints are essential to ensure tool usability, manifesting as responsiveness for end users accessing online source code. achieving precise highlighting is critical for enhancing code comprehensibility. addressing the development costs of such resolvers is imperative, given the multitude of programming language versions.
arXiv Detail & Related papers (2024-02-13T19:43:22Z)
INSPECT: Intrinsic and Systematic Probing Evaluation for Code Transformers [7.255653248042546]
We use a framework to define 15 probing tasks that exercise surface, syntactic, structural and semantic characteristics of source code. We probe 8 pre-trained source code models, as well as a natural language model (BERT) as our baseline. We find that models that incorporate some structural information (such as GraphCodeBERT) have a better representation of source code characteristics.
arXiv Detail & Related papers (2023-12-08T15:21:54Z)
LongCoder: A Long-Range Pre-trained Language Model for Code Completion [56.813974784131624]
LongCoder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens. Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction. memory tokens are included to highlight important statements that may be invoked later and need to be memorized.
arXiv Detail & Related papers (2023-06-26T17:59:24Z)
Emergent Representations of Program Semantics in Language Models Trained on Programs [3.376269351435396]
We present evidence that language models (LMs) of code can learn to represent the formal semantics of programs. We train a Transformer model on a synthetic corpus of programs written in a domain-specific language for navigating 2D grid world environments.
arXiv Detail & Related papers (2023-05-18T17:58:08Z)
Enriching Source Code with Contextual Data for Code Completion Models: An Empirical Study [4.438873396405334]
We aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion. For comments, we find that the models perform better in the presence of multi-line comments.
arXiv Detail & Related papers (2023-04-24T17:09:14Z)
Toward a Theory of Causation for Interpreting Neural Code Models [49.906221295459275]
This paper introduces $do_code$, a post hoc interpretability method specific to Neural Code Models (NCMs) $do_code$ is based upon causal inference to enable language-oriented explanations. Results show that our studied NCMs are sensitive to changes in code syntax.
arXiv Detail & Related papers (2023-02-07T22:56:58Z)
ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval. We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z)
Towards Full-line Code Completion with Neural Language Models [25.458883198815393]
We discuss the probability of directly completing a whole line of code instead of a single token. Recent neural language models have been adopted as a preferred approach for code completion.
arXiv Detail & Related papers (2020-09-18T03:12:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.