Neural Models for Source Code Synthesis and Completion
- URL: http://arxiv.org/abs/2402.06690v1
- Date: Thu, 8 Feb 2024 17:10:12 GMT
- Title: Neural Models for Source Code Synthesis and Completion
- Authors: Mitodru Niyogi
- Abstract summary: Natural language (NL) to code suggestion systems assist developers in Integrated Development Environments (IDEs) by translating NL utterances into compilable code snippet.
Current approaches mainly involve hard-coded, rule-based systems based on semantic parsing.
We present sequence-to-sequence deep learning models and training paradigms to map NL to general-purpose programming languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language (NL) to code suggestion systems assist developers in
Integrated Development Environments (IDEs) by translating NL utterances into
compilable code snippet. The current approaches mainly involve hard-coded,
rule-based systems based on semantic parsing. These systems make heavy use of
hand-crafted rules that map patterns in NL or elements in its syntax parse tree
to various query constructs and can only work on a limited subset of NL with a
restricted NL syntax. These systems are unable to extract semantic information
from the coding intents of the developer, and often fail to infer types, names,
and the context of the source code to get accurate system-level code
suggestions. In this master thesis, we present sequence-to-sequence deep
learning models and training paradigms to map NL to general-purpose programming
languages that can assist users with suggestions of source code snippets, given
a NL intent, and also extend auto-completion functionality of the source code
to users while they are writing source code. The developed architecture
incorporates contextual awareness into neural models which generate source code
tokens directly instead of generating parse trees/abstract meaning
representations from the source code and converting them back to source code.
The proposed pretraining strategy and the data augmentation techniques improve
the performance of the proposed architecture. The proposed architecture has
been found to exceed the performance of a neural semantic parser, TranX, based
on the BLEU-4 metric by 10.82%. Thereafter, a finer analysis for the parsable
code translations from the NL intent for CoNaLA challenge was introduced. The
proposed system is bidirectional as it can be also used to generate NL code
documentation given source code. Lastly, a RoBERTa masked language model for
Python was proposed to extend the developed system for code completion.
Related papers
- CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for
Code Generation [22.219645213202178]
This paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT.
We show that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.
arXiv Detail & Related papers (2023-10-16T05:09:58Z) - Neural Machine Translation for Code Generation [0.7607163273993514]
In NMT for code generation, the task is to generate source code that satisfies constraints expressed in the input.
In this paper we survey the NMT for code generation literature, cataloging the variety of methods that have been explored.
We discuss the limitations of existing methods and future research directions.
arXiv Detail & Related papers (2023-05-22T21:43:12Z) - DocCoder: Generating Code by Retrieving and Reading Docs [87.88474546826913]
We introduce DocCoder, an approach that explicitly leverages code manuals and documentation.
Our approach is general, can be applied to any programming language, and is agnostic to the underlying neural model.
arXiv Detail & Related papers (2022-07-13T06:47:51Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z) - Retrieve and Refine: Exemplar-based Neural Comment Generation [27.90756259321855]
Comments of similar code snippets are helpful for comment generation.
We design a novel seq2seq neural network that takes the given code, its AST, its similar code, and its exemplar as input.
We evaluate our approach on a large-scale Java corpus, which contains about 2M samples.
arXiv Detail & Related papers (2020-10-09T09:33:10Z) - Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z) - Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description.
We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z) - CodeBERT: A Pre-Trained Model for Programming and Natural Languages [117.34242908773061]
CodeBERT is a pre-trained model for programming language (PL) and nat-ural language (NL)
We develop CodeBERT with Transformer-based neural architecture.
We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters.
arXiv Detail & Related papers (2020-02-19T13:09:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.