CERT: Continual Pre-Training on Sketches for Library-Oriented Code
Generation
- URL: http://arxiv.org/abs/2206.06888v1
- Date: Tue, 14 Jun 2022 14:44:34 GMT
- Title: CERT: Continual Pre-Training on Sketches for Library-Oriented Code
Generation
- Authors: Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan,
Yongji Wang, Weizhu Chen, Jian-Guang Lou
- Abstract summary: We show how to leverage an unlabelled code corpus to train a model for library-oriented code generation.
We craft two benchmarks named PandasEval and NumpyEval to evaluate library-oriented code generation.
- Score: 46.45445767488915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code generation is a longstanding challenge, aiming to generate a code
snippet based on a natural language description. Usually, expensive text-code
paired data is essential for training a code generation model. Recently, thanks
to the success of pre-training techniques, large language models are trained on
large-scale unlabelled code corpora and perform well in code generation. In
this paper, we investigate how to leverage an unlabelled code corpus to train a
model for library-oriented code generation. Since it is a common practice for
programmers to reuse third-party libraries, in which case the text-code paired
data are harder to obtain due to the huge number of libraries. We observe that
library-oriented code snippets are more likely to share similar code sketches.
Hence, we present CERT with two steps: a sketcher generates the sketch, then a
generator fills the details in the sketch. Both the sketcher and the generator
are continually pre-trained upon a base model using unlabelled data.
Furthermore, we craft two benchmarks named PandasEval and NumpyEval to evaluate
library-oriented code generation. Experimental results demonstrate the
impressive performance of CERT. For example, it surpasses the base model by an
absolute 15.67% improvement in terms of pass@1 on PandasEval. Our work is
available at https://github.com/microsoft/PyCodeGPT.
Related papers
- Code Execution with Pre-trained Language Models [88.04688617516827]
Most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures.
We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution.
We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension.
arXiv Detail & Related papers (2023-05-08T10:00:05Z) - Knowledge Transfer for Pseudo-code Generation from Low Resource
Programming Language [13.716669765394293]
We focus on transferring the knowledge acquired by the code-to-pseudocode neural model trained on a high resource PL (C++) using parallel code-pseudocode data.
We observe an improvement of 23.27% in the success rate of the generated C codes through back translation.
arXiv Detail & Related papers (2023-03-16T03:38:08Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - DocCoder: Generating Code by Retrieving and Reading Docs [87.88474546826913]
We introduce DocCoder, an approach that explicitly leverages code manuals and documentation.
Our approach is general, can be applied to any programming language, and is agnostic to the underlying neural model.
arXiv Detail & Related papers (2022-07-13T06:47:51Z) - InCoder: A Generative Model for Code Infilling and Synthesis [88.46061996766348]
We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) and editing (via infilling)
InCoder is trained to generate code files from a large corpus of permissively licensed code.
Our model is the first generative model that is able to directly perform zero-shot code infilling.
arXiv Detail & Related papers (2022-04-12T16:25:26Z) - Code Generation for Unknown Libraries via Reading API Documentations [10.122354606820416]
We consider the challenge of code generation for unknown libraries without additional training.
We implement a model that can extract relevant code signatures from API documentations based on a natural language intent.
arXiv Detail & Related papers (2022-02-16T00:36:33Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - What do pre-trained code models know about code? [9.60966128833701]
We use diagnostic tasks called probes to investigate pre-trained code models.
BERT (pre-trained on English), CodeBERT and CodeBERTa (pre-trained on source code, and natural language documentation), and GraphCodeBERT (pre-trained on source code with dataflow) are investigated.
arXiv Detail & Related papers (2021-08-25T16:20:17Z) - InferCode: Self-Supervised Learning of Code Representations by
Predicting Subtrees [17.461451218469062]
This paper proposes InferCode to overcome the limitation by adapting the self-language learning mechanism to build source code model.
Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction.
Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model.
arXiv Detail & Related papers (2020-12-13T10:33:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.