Neural Models for Source Code Synthesis and Completion
- URL: http://arxiv.org/abs/2402.06690v1
- Date: Thu, 8 Feb 2024 17:10:12 GMT
- Title: Neural Models for Source Code Synthesis and Completion
- Authors: Mitodru Niyogi
- Abstract summary: Natural language (NL) to code suggestion systems assist developers in Integrated Development Environments (IDEs) by translating NL utterances into compilable code snippet.
Current approaches mainly involve hard-coded, rule-based systems based on semantic parsing.
We present sequence-to-sequence deep learning models and training paradigms to map NL to general-purpose programming languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language (NL) to code suggestion systems assist developers in
Integrated Development Environments (IDEs) by translating NL utterances into
compilable code snippet. The current approaches mainly involve hard-coded,
rule-based systems based on semantic parsing. These systems make heavy use of
hand-crafted rules that map patterns in NL or elements in its syntax parse tree
to various query constructs and can only work on a limited subset of NL with a
restricted NL syntax. These systems are unable to extract semantic information
from the coding intents of the developer, and often fail to infer types, names,
and the context of the source code to get accurate system-level code
suggestions. In this master thesis, we present sequence-to-sequence deep
learning models and training paradigms to map NL to general-purpose programming
languages that can assist users with suggestions of source code snippets, given
a NL intent, and also extend auto-completion functionality of the source code
to users while they are writing source code. The developed architecture
incorporates contextual awareness into neural models which generate source code
tokens directly instead of generating parse trees/abstract meaning
representations from the source code and converting them back to source code.
The proposed pretraining strategy and the data augmentation techniques improve
the performance of the proposed architecture. The proposed architecture has
been found to exceed the performance of a neural semantic parser, TranX, based
on the BLEU-4 metric by 10.82%. Thereafter, a finer analysis for the parsable
code translations from the NL intent for CoNaLA challenge was introduced. The
proposed system is bidirectional as it can be also used to generate NL code
documentation given source code. Lastly, a RoBERTa masked language model for
Python was proposed to extend the developed system for code completion.
Related papers
- CodeGRAG: Extracting Composed Syntax Graphs for Retrieval Augmented Cross-Lingual Code Generation [60.799992690487336]
We propose Syntax Graph Retrieval Augmented Code Generation (CodeGRAG) to enhance the performance of LLMs in single-round code generation tasks.
CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [59.32609948217718]
We present CodeIP, a new watermarking technique for Large Language Models (LLMs)-based code generation.
CodeIP enables the insertion of multi-bit information while preserving the semantics of the generated code.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graphs [5.953617559607503]
Vul-LMGNN is a unified model that combines pre-trained code language models with code property graphs.
Vul-LMGNN constructs a code property graph that integrates various code attributes into a unified graph structure.
To effectively retain dependency information among various attributes, we introduce a gated code Graph Neural Network.
arXiv Detail & Related papers (2024-04-23T03:48:18Z) - Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for
Code Generation [22.219645213202178]
This paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT.
We show that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.
arXiv Detail & Related papers (2023-10-16T05:09:58Z) - Neural Machine Translation for Code Generation [0.7607163273993514]
In NMT for code generation, the task is to generate source code that satisfies constraints expressed in the input.
In this paper we survey the NMT for code generation literature, cataloging the variety of methods that have been explored.
We discuss the limitations of existing methods and future research directions.
arXiv Detail & Related papers (2023-05-22T21:43:12Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z) - Retrieve and Refine: Exemplar-based Neural Comment Generation [27.90756259321855]
Comments of similar code snippets are helpful for comment generation.
We design a novel seq2seq neural network that takes the given code, its AST, its similar code, and its exemplar as input.
We evaluate our approach on a large-scale Java corpus, which contains about 2M samples.
arXiv Detail & Related papers (2020-10-09T09:33:10Z) - Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z) - Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description.
We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z) - CodeBERT: A Pre-Trained Model for Programming and Natural Languages [117.34242908773061]
CodeBERT is a pre-trained model for programming language (PL) and nat-ural language (NL)
We develop CodeBERT with Transformer-based neural architecture.
We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters.
arXiv Detail & Related papers (2020-02-19T13:09:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.