SkCoder: A Sketch-based Approach for Automatic Code Generation
- URL: http://arxiv.org/abs/2302.06144v4
- Date: Thu, 7 Sep 2023 11:26:46 GMT
- Title: SkCoder: A Sketch-based Approach for Automatic Code Generation
- Authors: Jia Li, Yongmin Li, Ge Li, Zhi Jin, Yiyang Hao, Xing Hu
- Abstract summary: We propose a sketch-based code generation approach named SkCoder to mimic developers' code reuse behavior.
Given a natural language requirement, SkCoder retrieves a similar code snippet, extracts relevant parts as a code sketch, and edits the sketch into the desired code.
Experimental results show that SkCoder can generate more correct programs, and outperforms the state-of-the-art - CodeT5-base by 30.30%, 35.39%, and 29.62% on three datasets.
- Score: 44.39900916450189
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, deep learning techniques have shown great success in automatic code
generation. Inspired by the code reuse, some researchers propose copy-based
approaches that can copy the content from similar code snippets to obtain
better performance. Practically, human developers recognize the content in the
similar code that is relevant to their needs, which can be viewed as a code
sketch. The sketch is further edited to the desired code. However, existing
copy-based approaches ignore the code sketches and tend to repeat the similar
code without necessary modifications, which leads to generating wrong results.
In this paper, we propose a sketch-based code generation approach named
SkCoder to mimic developers' code reuse behavior. Given a natural language
requirement, SkCoder retrieves a similar code snippet, extracts relevant parts
as a code sketch, and edits the sketch into the desired code. Our motivations
are that the extracted sketch provides a well-formed pattern for telling models
"how to write". The post-editing further adds requirement-specific details to
the sketch and outputs the complete code. We conduct experiments on two public
datasets and a new dataset collected by this work. We compare our approach to
20 baselines using 5 widely used metrics. Experimental results show that (1)
SkCoder can generate more correct programs, and outperforms the
state-of-the-art - CodeT5-base by 30.30%, 35.39%, and 29.62% on three datasets.
(2) Our approach is effective to multiple code generation models and improves
them by up to 120.1% in Pass@1. (3) We investigate three plausible code
sketches and discuss the importance of sketches. (4) We manually evaluate the
generated code and prove the superiority of our SkCoder in three aspects.
Related papers
- CodeS: Natural Language to Code Repository via Multi-Layer Sketch [33.29324601674667]
We introduce a new software engineering task, namely Natural Language to code Repository (NL2Repo)
This task aims to generate an entire code repository from its natural language requirements.
We propose a simple yet effective framework CodeS, which decomposes NL2Repo into multiple sub-tasks by a multi-layer sketch.
arXiv Detail & Related papers (2024-03-25T06:09:55Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Soft-Labeled Contrastive Pre-training for Function-level Code
Representation [127.71430696347174]
We present textbfSCodeR, a textbfSoft-labeled contrastive pre-training framework with two positive sample construction methods.
Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels.
SCodeR achieves new state-of-the-art performance on four code-related tasks over seven datasets.
arXiv Detail & Related papers (2022-10-18T05:17:37Z) - CERT: Continual Pre-Training on Sketches for Library-Oriented Code
Generation [46.45445767488915]
We show how to leverage an unlabelled code corpus to train a model for library-oriented code generation.
We craft two benchmarks named PandasEval and NumpyEval to evaluate library-oriented code generation.
arXiv Detail & Related papers (2022-06-14T14:44:34Z) - CODE-MVP: Learning to Represent Source Code from Multiple Views with
Contrastive Pre-Training [26.695345034376388]
We propose to integrate different views with the natural-language description of source code into a unified framework with Multi-View contrastive Pre-training.
Specifically, we first extract multiple code views using compiler tools, and learn the complementary information among them under a contrastive learning framework.
Experiments on three downstream tasks over five datasets demonstrate the superiority of CODE-MVP when compared with several state-of-the-art baselines.
arXiv Detail & Related papers (2022-05-04T12:40:58Z) - InCoder: A Generative Model for Code Infilling and Synthesis [88.46061996766348]
We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) and editing (via infilling)
InCoder is trained to generate code files from a large corpus of permissively licensed code.
Our model is the first generative model that is able to directly perform zero-shot code infilling.
arXiv Detail & Related papers (2022-04-12T16:25:26Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - LLC: Accurate, Multi-purpose Learnt Low-dimensional Binary Codes [55.32790803903619]
We propose a novel method for Learning Low-dimensional binary Codes (LLC) for instances as well as classes.
Our method does not require any side-information, like annotated attributes or label meta-data.
We demonstrate that the learnt codes capture intrinsically important features in the data, by discovering an intuitive taxonomy over classes.
arXiv Detail & Related papers (2021-06-02T21:57:52Z) - Self-Supervised Contrastive Learning for Code Retrieval and
Summarization via Semantic-Preserving Transformations [28.61567319928316]
Corder is a self-supervised contrastive learning framework for source code model.
Key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets.
We have shown that the code models pretrained by Corder substantially outperform the other baselines for code-to-code retrieval, text-to-code retrieval, and code-to-text summarization tasks.
arXiv Detail & Related papers (2020-09-06T13:31:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.