DocCoder: Generating Code by Retrieving and Reading Docs
- URL: http://arxiv.org/abs/2207.05987v1
- Date: Wed, 13 Jul 2022 06:47:51 GMT
- Title: DocCoder: Generating Code by Retrieving and Reading Docs
- Authors: Shuyan Zhou and Uri Alon and Frank F. Xu and Zhengbao JIang and Graham
Neubig
- Abstract summary: We introduce DocCoder, an approach that explicitly leverages code manuals and documentation.
Our approach is general, can be applied to any programming language, and is agnostic to the underlying neural model.
- Score: 87.88474546826913
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural-language-to-code models learn to generate a code snippet given a
natural language (NL) intent. However, the rapid growth of both publicly
available and proprietary libraries and functions makes it impossible to cover
all APIs using training examples, as new libraries and functions are introduced
daily. Thus, existing models inherently cannot generalize to using unseen
functions and libraries merely through incorporating them into the training
data. In contrast, when human programmers write programs, they frequently refer
to textual resources such as code manuals, documentation, and tutorials, to
explore and understand available library functionality. Inspired by this
observation, we introduce DocCoder: an approach that explicitly leverages code
manuals and documentation by (1) retrieving the relevant documentation given
the NL intent, and (2) generating the code based on the NL intent and the
retrieved documentation. Our approach is general, can be applied to any
programming language, and is agnostic to the underlying neural model. We
demonstrate that DocCoder consistently improves NL-to-code models: DocCoder
achieves 11x higher exact match accuracy than strong baselines on a new Bash
dataset tldr; on the popular Python CoNaLa benchmark, DocCoder improves over
strong baselines by 1.65 BLEU.
Related papers
- DocCGen: Document-based Controlled Code Generation [33.19206322891497]
DocCGen is a framework that can leverage rich knowledge by breaking the NL-to-Code generation task for structured code languages into a two-step process.
Our experiments show that DocCGen consistently improves different-sized language models across all six evaluation metrics.
arXiv Detail & Related papers (2024-06-17T08:34:57Z) - Towards Summarizing Code Snippets Using Pre-Trained Transformers [20.982048349530483]
In this work, we take all the steps needed to train a DL model to document code snippets.
Our model identifies code summaries with 84% accuracy and is able to link them to the documented lines of code.
This unlocked the possibility of building a large-scale dataset of documented code snippets.
arXiv Detail & Related papers (2024-02-01T11:39:19Z) - Generation-Augmented Query Expansion For Code Retrieval [51.20943646688115]
We propose a generation-augmented query expansion framework.
Inspired by the human retrieval process - sketching an answer before searching.
We achieve new state-of-the-art results on the CodeSearchNet benchmark.
arXiv Detail & Related papers (2022-12-20T23:49:37Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - CERT: Continual Pre-Training on Sketches for Library-Oriented Code
Generation [46.45445767488915]
We show how to leverage an unlabelled code corpus to train a model for library-oriented code generation.
We craft two benchmarks named PandasEval and NumpyEval to evaluate library-oriented code generation.
arXiv Detail & Related papers (2022-06-14T14:44:34Z) - StructCoder: Structure-Aware Transformer for Code Generation [13.797842927671846]
We introduce a structure-aware Transformer decoder that models both syntax and data flow to enhance the quality of generated code.
The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks.
arXiv Detail & Related papers (2022-06-10T17:26:31Z) - InCoder: A Generative Model for Code Infilling and Synthesis [88.46061996766348]
We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) and editing (via infilling)
InCoder is trained to generate code files from a large corpus of permissively licensed code.
Our model is the first generative model that is able to directly perform zero-shot code infilling.
arXiv Detail & Related papers (2022-04-12T16:25:26Z) - Code Generation for Unknown Libraries via Reading API Documentations [10.122354606820416]
We consider the challenge of code generation for unknown libraries without additional training.
We implement a model that can extract relevant code signatures from API documentations based on a natural language intent.
arXiv Detail & Related papers (2022-02-16T00:36:33Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.