SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization
- URL: http://arxiv.org/abs/2401.14727v1
- Date: Fri, 26 Jan 2024 09:23:27 GMT
- Title: SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization
- Authors: Yanlin Wang, Yanxian Huang, Daya Guo, Hongyu Zhang and Zibin Zheng
- Abstract summary: This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
- Score: 51.67317895094664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code summarization aims to generate natural language descriptions of source
code, facilitating programmers to understand and maintain it rapidly. While
previous code summarization efforts have predominantly focused on method-level,
this paper studies file-level code summarization, which can assist programmers
in understanding and maintaining large source code projects. Unlike
method-level code summarization,file-level code summarization typically
involves long source code within a single file, which makes it challenging for
Transformer-based models to understand the code semantics for the maximum input
length of these models is difficult to set to a large number that can handle
long code input well, due to the quadratic scaling of computational complexity
with the input sequence length. To address this challenge, we propose
SparseCoder, an identifier-aware sparse transformer for effectively handling
long code sequences. Specifically, the SparseCoder employs a sliding window
mechanism for self-attention to model short-term dependencies and leverages the
structure message of code to capture long-term dependencies among source code
identifiers by introducing two types of sparse attention patterns named global
and identifier attention. To evaluate the performance of SparseCoder, we
construct a new dataset FILE-CS for file-level code summarization in Python.
Experimental results show that our SparseCoder model achieves state-of-the-art
performance compared with other pre-trained models, including full
self-attention and sparse models. Additionally, our model has low memory
overhead and achieves comparable performance with models using full
self-attention mechanism.
Related papers
- Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for
Code Generation [22.219645213202178]
This paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT.
We show that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.
arXiv Detail & Related papers (2023-10-16T05:09:58Z) - LongCoder: A Long-Range Pre-trained Language Model for Code Completion [56.813974784131624]
LongCoder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens.
Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction.
memory tokens are included to highlight important statements that may be invoked later and need to be memorized.
arXiv Detail & Related papers (2023-06-26T17:59:24Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z) - Project-Level Encoding for Neural Source Code Summarization of
Subroutines [6.939768185086755]
We present a project-level encoder to improve models of code summarization.
We use that representation to augment the encoder of state-of-the-art neural code summarization techniques.
arXiv Detail & Related papers (2021-03-22T06:01:07Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.