Project-Level Encoding for Neural Source Code Summarization of
Subroutines
- URL: http://arxiv.org/abs/2103.11599v1
- Date: Mon, 22 Mar 2021 06:01:07 GMT
- Title: Project-Level Encoding for Neural Source Code Summarization of
Subroutines
- Authors: Aakash Bansal, Sakib Haque, Collin McMillan
- Abstract summary: We present a project-level encoder to improve models of code summarization.
We use that representation to augment the encoder of state-of-the-art neural code summarization techniques.
- Score: 6.939768185086755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Source code summarization of a subroutine is the task of writing a short,
natural language description of that subroutine. The description usually serves
in documentation aimed at programmers, where even brief phrase (e.g.
"compresses data to a zip file") can help readers rapidly comprehend what a
subroutine does without resorting to reading the code itself. Techniques based
on neural networks (and encoder-decoder model designs in particular) have
established themselves as the state-of-the-art. Yet a problem widely recognized
with these models is that they assume the information needed to create a
summary is present within the code being summarized itself - an assumption
which is at odds with program comprehension literature. Thus a current research
frontier lies in the question of encoding source code context into neural
models of summarization. In this paper, we present a project-level encoder to
improve models of code summarization. By project-level, we mean that we create
a vectorized representation of selected code files in a software project, and
use that representation to augment the encoder of state-of-the-art neural code
summarization techniques. We demonstrate how our encoder improves several
existing models, and provide guidelines for maximizing improvement while
controlling time and resource costs in model size.
Related papers
- ESALE: Enhancing Code-Summary Alignment Learning for Source Code Summarization [21.886950861445122]
Code summarization aims to automatically generate succinct natural language summaries for given code snippets.
This paper proposes a novel approach to improve code summarization based on summary-focused tasks.
arXiv Detail & Related papers (2024-07-01T03:06:51Z) - Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Statement-based Memory for Neural Source Code Summarization [4.024850952459758]
Code summarization underpins software documentation for programmers.
Lately, neural source code summarization has emerged as the frontier of research into automated code summarization techniques.
We present a statement-based memory encoder that learns the important elements of flow during training, leading to a statement-based subroutine representation.
arXiv Detail & Related papers (2023-07-21T17:04:39Z) - Tram: A Token-level Retrieval-augmented Mechanism for Source Code Summarization [76.57699934689468]
We propose a fine-grained Token-level retrieval-augmented mechanism (Tram) on the decoder side to enhance the performance of neural models.
To overcome the challenge of token-level retrieval in capturing contextual code semantics, we also propose integrating code semantics into individual summary tokens.
arXiv Detail & Related papers (2023-05-18T16:02:04Z) - Decoder-Only or Encoder-Decoder? Interpreting Language Model as a
Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence.
Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text.
Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z) - StructCoder: Structure-Aware Transformer for Code Generation [13.797842927671846]
We introduce a structure-aware Transformer decoder that models both syntax and data flow to enhance the quality of generated code.
The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks.
arXiv Detail & Related papers (2022-06-10T17:26:31Z) - GypSum: Learning Hybrid Representations for Code Summarization [21.701127410434914]
GypSum is a new deep learning model that learns hybrid representations using graph attention neural networks and a pre-trained programming and natural language model.
We modify the encoder-decoder sublayer in the Transformer's decoder to fuse the representations and propose a dual-copy mechanism to facilitate summary generation.
arXiv Detail & Related papers (2022-04-26T07:44:49Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.