Related papers: GypSum: Learning Hybrid Representations for Code Summarization

GypSum: Learning Hybrid Representations for Code Summarization

URL: http://arxiv.org/abs/2204.12916v1
Date: Tue, 26 Apr 2022 07:44:49 GMT
Title: GypSum: Learning Hybrid Representations for Code Summarization
Authors: Yu Wang, Yu Dong, Xuesong Lu, Aoying Zhou
Abstract summary: GypSum is a new deep learning model that learns hybrid representations using graph attention neural networks and a pre-trained programming and natural language model. We modify the encoder-decoder sublayer in the Transformer's decoder to fuse the representations and propose a dual-copy mechanism to facilitate summary generation.
Score: 21.701127410434914
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code summarization with deep learning has been widely studied in recent years. Current deep learning models for code summarization generally follow the principle in neural machine translation and adopt the encoder-decoder framework, where the encoder learns the semantic representations from source code and the decoder transforms the learnt representations into human-readable text that describes the functionality of code snippets. Despite they achieve the new state-of-the-art performance, we notice that current models often either generate less fluent summaries, or fail to capture the core functionality, since they usually focus on a single type of code representations. As such we propose GypSum, a new deep learning model that learns hybrid representations using graph attention neural networks and a pre-trained programming and natural language model. We introduce particular edges related to the control flow of a code snippet into the abstract syntax tree for graph construction, and design two encoders to learn from the graph and the token sequence of source code, respectively. We modify the encoder-decoder sublayer in the Transformer's decoder to fuse the representations and propose a dual-copy mechanism to facilitate summary generation. Experimental results demonstrate the superior performance of GypSum over existing code summarization models.

Related papers

UniGenCoder: Merging Seq2Seq and Seq2Tree Paradigms for Unified Code Generation [32.315975899771495]
Existing approaches to code generation have focused on the Sequence-to-Sequence paradigm, which generates target code as a sequence of tokens, or the Sequence-to-Tree paradigm, which outputs code as a sequence of actions. We propose UniGenCoder for code-related generation tasks, which consists of a shared encoder, a shared decoder with a minimal set of additional parameters, and a selector that dynamically chooses optimal paradigm for each instance. Experimental results on the text-to-code and code-to-code generation tasks demonstrate the effectiveness of our proposed model.
arXiv Detail & Related papers (2025-02-18T03:19:48Z)
ESALE: Enhancing Code-Summary Alignment Learning for Source Code Summarization [21.886950861445122]
Code summarization aims to automatically generate succinct natural language summaries for given code snippets. This paper proposes a novel approach to improve code summarization based on summary-focused tasks.
arXiv Detail & Related papers (2024-07-01T03:06:51Z)
SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects. We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z)
Statement-based Memory for Neural Source Code Summarization [4.024850952459758]
Code summarization underpins software documentation for programmers. Lately, neural source code summarization has emerged as the frontier of research into automated code summarization techniques. We present a statement-based memory encoder that learns the important elements of flow during training, leading to a statement-based subroutine representation.
arXiv Detail & Related papers (2023-07-21T17:04:39Z)
Tram: A Token-level Retrieval-augmented Mechanism for Source Code Summarization [76.57699934689468]
We propose a fine-grained Token-level retrieval-augmented mechanism (Tram) on the decoder side to enhance the performance of neural models. To overcome the challenge of token-level retrieval in capturing contextual code semantics, we also propose integrating code semantics into individual summary tokens.
arXiv Detail & Related papers (2023-05-18T16:02:04Z)
Real-World Compositional Generalization with Disentangled Sequence-to-Sequence Learning [81.24269148865555]
A recently proposed Disentangled sequence-to-sequence model (Dangle) shows promising generalization capability. We introduce two key modifications to this model which encourage more disentangled representations and improve its compute and memory efficiency. Specifically, instead of adaptively re-encoding source keys and values at each time step, we disentangle their representations and only re-encode keys periodically.
arXiv Detail & Related papers (2022-12-12T15:40:30Z)
StructCoder: Structure-Aware Transformer for Code Generation [13.797842927671846]
We introduce a structure-aware Transformer decoder that models both syntax and data flow to enhance the quality of generated code. The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks.
arXiv Detail & Related papers (2022-06-10T17:26:31Z)
What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code [32.345301158791045]
Pre-trained language models for source code have been proposed to model the context of code. These models leverage masked pre-training and Transformer. It is not clear why these models work and what feature correlations they can capture.
arXiv Detail & Related papers (2022-02-14T16:22:10Z)
Project-Level Encoding for Neural Source Code Summarization of Subroutines [6.939768185086755]
We present a project-level encoder to improve models of code summarization. We use that representation to augment the encoder of state-of-the-art neural code summarization techniques.
arXiv Detail & Related papers (2021-03-22T06:01:07Z)
GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens. We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph. It is updated by decoding in the context of an auto-encoder. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.