CodeExp: Explanatory Code Document Generation
- URL: http://arxiv.org/abs/2211.15395v1
- Date: Fri, 25 Nov 2022 18:05:44 GMT
- Title: CodeExp: Explanatory Code Document Generation
- Authors: Haotian Cui, Chenglong Wang, Junjie Huang, Jeevana Priya Inala, Todd
Mytkowicz, Bo Wang, Jianfeng Gao, Nan Duan
- Abstract summary: Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
- Score: 94.43677536210465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing models that can automatically generate detailed code explanation
can greatly benefit software maintenance and programming education. However,
existing code-to-text generation models often produce only high-level summaries
of code that do not capture implementation-level choices essential for these
scenarios. To fill in this gap, we propose the code explanation generation
task. We first conducted a human study to identify the criteria for
high-quality explanatory docstring for code. Based on that, we collected and
refined a large-scale code docstring corpus and formulated automatic evaluation
metrics that best match human assessments. Finally, we present a multi-stage
fine-tuning strategy and baseline models for the task. Our experiments show
that (1) our refined training dataset lets models achieve better performance in
the explanation generation tasks compared to larger unrefined data (15x
larger), and (2) fine-tuned models can generate well-structured long docstrings
comparable to human-written ones. We envision our training dataset,
human-evaluation protocol, recommended metrics, and fine-tuning strategy can
boost future code explanation research. The code and annotated data are
available at https://github.com/subercui/CodeExp.
Related papers
- Enriching Source Code with Contextual Data for Code Completion Models:
An Empirical Study [4.438873396405334]
We aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion.
For comments, we find that the models perform better in the presence of multi-line comments.
arXiv Detail & Related papers (2023-04-24T17:09:14Z) - Stochastic Code Generation [1.7205106391379026]
Large language models pre-trained for code generation can generate high-quality short code but often struggle with generating coherent long code.
This issue is also observed in language modeling for long text generation.
In this study, we investigate whether this technique can be applied to code generation to improve coherence.
arXiv Detail & Related papers (2023-04-14T00:01:05Z) - Generation-Augmented Query Expansion For Code Retrieval [51.20943646688115]
We propose a generation-augmented query expansion framework.
Inspired by the human retrieval process - sketching an answer before searching.
We achieve new state-of-the-art results on the CodeSearchNet benchmark.
arXiv Detail & Related papers (2022-12-20T23:49:37Z) - Execution-based Evaluation for Data Science Code Generation Models [97.96608263010913]
We introduce ExeDS, an evaluation dataset for execution evaluation for data science code generation tasks.
ExeDS contains a set of 534 problems from Jupyter Notebooks, each consisting of code context, task description, reference program, and desired execution output.
We evaluate the execution performance of five state-of-the-art code generation models that have achieved high surface-form evaluation scores.
arXiv Detail & Related papers (2022-11-17T07:04:11Z) - Incorporating Domain Knowledge through Task Augmentation for Front-End
JavaScript Code Generation [10.75138604869187]
In some domain-specific scenarios, building such a large paired corpus for code generation is difficult because there is no directly available pairing data.
We propose a task augmentation method that incorporates domain knowledge into code generation models through auxiliary tasks and a Subtoken-TranX model.
Our experimental results demonstrate that the subtoken-level TranX model outperforms the original TranX model and the Transformer model on our dataset.
arXiv Detail & Related papers (2022-08-22T06:57:51Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - What do pre-trained code models know about code? [9.60966128833701]
We use diagnostic tasks called probes to investigate pre-trained code models.
BERT (pre-trained on English), CodeBERT and CodeBERTa (pre-trained on source code, and natural language documentation), and GraphCodeBERT (pre-trained on source code with dataflow) are investigated.
arXiv Detail & Related papers (2021-08-25T16:20:17Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.