StructCoder: Structure-Aware Transformer for Code Generation
- URL: http://arxiv.org/abs/2206.05239v3
- Date: Tue, 30 Jan 2024 22:21:04 GMT
- Title: StructCoder: Structure-Aware Transformer for Code Generation
- Authors: Sindhu Tipirneni, Ming Zhu, Chandan K. Reddy
- Abstract summary: We introduce a structure-aware Transformer decoder that models both syntax and data flow to enhance the quality of generated code.
The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks.
- Score: 13.797842927671846
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: There has been a recent surge of interest in automating software engineering
tasks using deep learning. This paper addresses the problem of code generation,
where the goal is to generate target code given source code in a different
language or a natural language description. Most state-of-the-art deep learning
models for code generation use training strategies primarily designed for
natural language. However, understanding and generating code requires a more
rigorous comprehension of the code syntax and semantics. With this motivation,
we develop an encoder-decoder Transformer model where both the encoder and
decoder are explicitly trained to recognize the syntax and data flow in the
source and target codes, respectively. We not only make the encoder
structure-aware by leveraging the source code's syntax tree and data flow
graph, but we also support the decoder in preserving the syntax and data flow
of the target code by introducing two novel auxiliary tasks: AST (Abstract
Syntax Tree) paths prediction and data flow prediction. To the best of our
knowledge, this is the first work to introduce a structure-aware Transformer
decoder that models both syntax and data flow to enhance the quality of
generated code. The proposed StructCoder model achieves state-of-the-art
performance on code translation and text-to-code generation tasks in the
CodeXGLUE benchmark, and improves over baselines of similar size on the APPS
code generation benchmark. Our code is publicly available at
https://github.com/reddy-lab-code-research/StructCoder/.
Related papers
- SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Statement-based Memory for Neural Source Code Summarization [4.024850952459758]
Code summarization underpins software documentation for programmers.
Lately, neural source code summarization has emerged as the frontier of research into automated code summarization techniques.
We present a statement-based memory encoder that learns the important elements of flow during training, leading to a statement-based subroutine representation.
arXiv Detail & Related papers (2023-07-21T17:04:39Z) - Outline, Then Details: Syntactically Guided Coarse-To-Fine Code
Generation [61.50286000143233]
ChainCoder is a program synthesis language model that generates Python code progressively.
A tailored transformer architecture is leveraged to jointly encode the natural language descriptions and syntactically aligned I/O data samples.
arXiv Detail & Related papers (2023-04-28T01:47:09Z) - Knowledge Transfer for Pseudo-code Generation from Low Resource
Programming Language [13.716669765394293]
We focus on transferring the knowledge acquired by the code-to-pseudocode neural model trained on a high resource PL (C++) using parallel code-pseudocode data.
We observe an improvement of 23.27% in the success rate of the generated C codes through back translation.
arXiv Detail & Related papers (2023-03-16T03:38:08Z) - GypSum: Learning Hybrid Representations for Code Summarization [21.701127410434914]
GypSum is a new deep learning model that learns hybrid representations using graph attention neural networks and a pre-trained programming and natural language model.
We modify the encoder-decoder sublayer in the Transformer's decoder to fuse the representations and propose a dual-copy mechanism to facilitate summary generation.
arXiv Detail & Related papers (2022-04-26T07:44:49Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z) - Contrastive Learning for Source Code with Structural and Functional
Properties [66.10710134948478]
We present BOOST, a novel self-supervised model to focus pre-training based on the characteristics of source code.
We employ automated, structure-guided code transformation algorithms that generate functionally equivalent code that looks drastically different from the original one.
We train our model in a way that brings the functionally equivalent code closer and distinct code further through a contrastive learning objective.
arXiv Detail & Related papers (2021-10-08T02:56:43Z) - Project-Level Encoding for Neural Source Code Summarization of
Subroutines [6.939768185086755]
We present a project-level encoder to improve models of code summarization.
We use that representation to augment the encoder of state-of-the-art neural code summarization techniques.
arXiv Detail & Related papers (2021-03-22T06:01:07Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.