Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation
- URL: http://arxiv.org/abs/2004.09015v1
- Date: Mon, 20 Apr 2020 01:45:27 GMT
- Title: Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation
- Authors: Frank F. Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, Graham
Neubig
- Abstract summary: Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
- Score: 97.97049697457425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-domain code generation aims to generate code in a general-purpose
programming language (such as Python) from natural language (NL) intents.
Motivated by the intuition that developers usually retrieve resources on the
web when writing code, we explore the effectiveness of incorporating two
varieties of external knowledge into NL-to-code generation: automatically mined
NL-code pairs from the online programming QA forum StackOverflow and
programming language API documentation. Our evaluations show that combining the
two sources with data augmentation and retrieval-based data re-sampling
improves the current state-of-the-art by up to 2.2% absolute BLEU score on the
code generation testbed CoNaLa. The code and resources are available at
https://github.com/neulab/external-knowledge-codegen.
Related papers
- CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - Neural Models for Source Code Synthesis and Completion [0.0]
Natural language (NL) to code suggestion systems assist developers in Integrated Development Environments (IDEs) by translating NL utterances into compilable code snippet.
Current approaches mainly involve hard-coded, rule-based systems based on semantic parsing.
We present sequence-to-sequence deep learning models and training paradigms to map NL to general-purpose programming languages.
arXiv Detail & Related papers (2024-02-08T17:10:12Z) - Neural Machine Translation for Code Generation [0.7607163273993514]
In NMT for code generation, the task is to generate source code that satisfies constraints expressed in the input.
In this paper we survey the NMT for code generation literature, cataloging the variety of methods that have been explored.
We discuss the limitations of existing methods and future research directions.
arXiv Detail & Related papers (2023-05-22T21:43:12Z) - Knowledge Transfer for Pseudo-code Generation from Low Resource
Programming Language [13.716669765394293]
We focus on transferring the knowledge acquired by the code-to-pseudocode neural model trained on a high resource PL (C++) using parallel code-pseudocode data.
We observe an improvement of 23.27% in the success rate of the generated C codes through back translation.
arXiv Detail & Related papers (2023-03-16T03:38:08Z) - DocCoder: Generating Code by Retrieving and Reading Docs [87.88474546826913]
We introduce DocCoder, an approach that explicitly leverages code manuals and documentation.
Our approach is general, can be applied to any programming language, and is agnostic to the underlying neural model.
arXiv Detail & Related papers (2022-07-13T06:47:51Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - Retrieve and Refine: Exemplar-based Neural Comment Generation [27.90756259321855]
Comments of similar code snippets are helpful for comment generation.
We design a novel seq2seq neural network that takes the given code, its AST, its similar code, and its exemplar as input.
We evaluate our approach on a large-scale Java corpus, which contains about 2M samples.
arXiv Detail & Related papers (2020-10-09T09:33:10Z) - CodeBERT: A Pre-Trained Model for Programming and Natural Languages [117.34242908773061]
CodeBERT is a pre-trained model for programming language (PL) and nat-ural language (NL)
We develop CodeBERT with Transformer-based neural architecture.
We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters.
arXiv Detail & Related papers (2020-02-19T13:09:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.