Sequence Model Design for Code Completion in the Modern IDE
- URL: http://arxiv.org/abs/2004.05249v1
- Date: Fri, 10 Apr 2020 22:40:49 GMT
- Title: Sequence Model Design for Code Completion in the Modern IDE
- Authors: Gareth Ari Aye and Gail E. Kaiser
- Abstract summary: We propose a novel design for predicting top-k next tokens that combines static analysis' ability to enumerate all valid keywords and in-scope identifiers with the ability of a language model to place a probability distribution over them.
Our model mixes character-level input representation with token output to represent out-of-vocabulary (OOV) tokens meaningfully and minimize prediction latency.
- Score: 3.4824234779710452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code completion plays a prominent role in modern integrated development
environments (IDEs). Machine learning has become ubiquitous in analogous
natural language writing and search software, surfacing more relevant
autocompletions and search suggestions in fewer keystrokes. Prior research has
reported training high-accuracy, deep neural networks for modeling source code,
but little attention has been given to the practical constraints imposed by
interactive developer tools. In particular, neural language models for source
code modeling like the one described in Maybe Deep Neural Networks are the Best
Choice for Modeling Source Code are framed around code completion, but only
report accuracy of next-token prediction. However, in order for a language
model (LM) to work well within real-world code completion systems, it must also
always make suggestions that produce valid code that typechecks to support code
completion's role in correctness-checking; return instantaneous results to help
programmers code more efficiently in fewer keystrokes; and be small enough to
fit comfortably on disk and in memory on developer workstations, since
virtually all modern IDEs run locally and support offline usage. To meet these
additional requirements, we propose a novel design for predicting top-k next
tokens that combines static analysis' ability to enumerate all valid keywords
and in-scope identifiers with the ability of a language model to place a
probability distribution over them. Our model mixes character-level input
representation with token output to represent out-of-vocabulary (OOV) tokens
meaningfully and minimize prediction latency. OOV tokens can be predicted
through detection of local repetition common in software. This design achieves
state-of-art accuracy in source code modeling and fits the constraints imposed
by real-world code completion implementations in modern IDEs.
Related papers
- CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [66.51005288743153]
We investigate the legal and ethical issues of current neural code completion models.
We tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks.
We evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models.
arXiv Detail & Related papers (2024-04-22T15:54:53Z) - On-the-Fly Syntax Highlighting: Generalisation and Speed-ups [2.208443815105053]
On-the-fly syntax highlighting is the task of rapidly associating visual secondary notation values with each character of a language derivation.
Speed constraints are essential to ensure tool usability, manifesting as responsiveness for end users accessing online source code.
achieving precise highlighting is critical for enhancing code comprehensibility.
addressing the development costs of such resolvers is imperative, given the multitude of programming language versions.
arXiv Detail & Related papers (2024-02-13T19:43:22Z) - INSPECT: Intrinsic and Systematic Probing Evaluation for Code
Transformers [7.255653248042546]
We use a framework to define 15 probing tasks that exercise surface, syntactic, structural and semantic characteristics of source code.
We probe 8 pre-trained source code models, as well as a natural language model (BERT) as our baseline.
We find that models that incorporate some structural information (such as GraphCodeBERT) have a better representation of source code characteristics.
arXiv Detail & Related papers (2023-12-08T15:21:54Z) - LongCoder: A Long-Range Pre-trained Language Model for Code Completion [56.813974784131624]
LongCoder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens.
Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction.
memory tokens are included to highlight important statements that may be invoked later and need to be memorized.
arXiv Detail & Related papers (2023-06-26T17:59:24Z) - Emergent Representations of Program Semantics in Language Models Trained on Programs [3.376269351435396]
We present evidence that language models (LMs) of code can learn to represent the formal semantics of programs.
We train a Transformer model on a synthetic corpus of programs written in a domain-specific language for navigating 2D grid world environments.
arXiv Detail & Related papers (2023-05-18T17:58:08Z) - Enriching Source Code with Contextual Data for Code Completion Models:
An Empirical Study [4.438873396405334]
We aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion.
For comments, we find that the models perform better in the presence of multi-line comments.
arXiv Detail & Related papers (2023-04-24T17:09:14Z) - Toward a Theory of Causation for Interpreting Neural Code Models [49.906221295459275]
This paper introduces $do_code$, a post hoc interpretability method specific to Neural Code Models (NCMs)
$do_code$ is based upon causal inference to enable language-oriented explanations.
Results show that our studied NCMs are sensitive to changes in code syntax.
arXiv Detail & Related papers (2023-02-07T22:56:58Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - Towards Full-line Code Completion with Neural Language Models [25.458883198815393]
We discuss the probability of directly completing a whole line of code instead of a single token.
Recent neural language models have been adopted as a preferred approach for code completion.
arXiv Detail & Related papers (2020-09-18T03:12:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.