Related papers: Tram: A Token-level Retrieval-augmented Mechanism for Source Code Summarization

Tram: A Token-level Retrieval-augmented Mechanism for Source Code Summarization

URL: http://arxiv.org/abs/2305.11074v3
Date: Sat, 30 Mar 2024 10:45:22 GMT
Title: Tram: A Token-level Retrieval-augmented Mechanism for Source Code Summarization
Authors: Tong Ye, Lingfei Wu, Tengfei Ma, Xuhong Zhang, Yangkai Du, Peiyu Liu, Shouling Ji, Wenhai Wang,
Abstract summary: We propose a fine-grained Token-level retrieval-augmented mechanism (Tram) on the decoder side to enhance the performance of neural models. To overcome the challenge of token-level retrieval in capturing contextual code semantics, we also propose integrating code semantics into individual summary tokens.
Score: 76.57699934689468
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatically generating human-readable text describing the functionality of a program is the intent of source code summarization. Although neural language models achieve significant performance in this field, they are limited by their inability to access external knowledge. To address this limitation, an emerging trend is combining neural models with external knowledge through retrieval methods. Previous methods have relied on the sentence-level retrieval paradigm on the encoder side. However, this paradigm is coarse-grained, noise-filled and cannot directly take advantage of the high-quality retrieved summary tokens on the decoder side. In this paper, we propose a fine-grained Token-level retrieval-augmented mechanism (Tram) on the decoder side rather than the encoder side to enhance the performance of neural models and produce more low-frequency tokens in generating summaries. Furthermore, to overcome the challenge of token-level retrieval in capturing contextual code semantics, we also propose integrating code semantics into individual summary tokens. The results of extensive experiments and human evaluation show that our token-level retrieval-augmented approach significantly improves performance and is more interpretable.

Related papers

Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings [78.05609552686053]
This work focuses on an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within the semantics.<n>We introduce a new evaluation dataset in Chinese, named CapRetrieval, whose passages are image captions, and queries are phrases inquiring entities or events in various forms.<n>Zero-shot evaluation suggests that encoders may fail on these fine-grained matching, regardless of training sources or model sizes.
arXiv Detail & Related papers (2025-06-10T09:00:33Z)
$ε$-VAE: Denoising as Visual Decoding [61.29255979767292]
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input. We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder. We evaluate our approach by assessing both reconstruction (rFID) and generation quality (
arXiv Detail & Related papers (2024-10-05T08:27:53Z)
Challenging Decoder helps in Masked Auto-Encoder Pre-training for Dense Passage Retrieval [10.905033385938982]
Masked auto-encoder (MAE) pre-training architecture has emerged as the most promising. We propose a novel token importance aware masking strategy based on pointwise mutual information to intensify the challenge of the decoder.
arXiv Detail & Related papers (2023-05-22T16:27:10Z)
GypSum: Learning Hybrid Representations for Code Summarization [21.701127410434914]
GypSum is a new deep learning model that learns hybrid representations using graph attention neural networks and a pre-trained programming and natural language model. We modify the encoder-decoder sublayer in the Transformer's decoder to fuse the representations and propose a dual-copy mechanism to facilitate summary generation.
arXiv Detail & Related papers (2022-04-26T07:44:49Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
Hierarchical Sketch Induction for Paraphrase Generation [79.87892048285819]
We introduce Hierarchical Refinement Quantized Variational Autoencoders (HRQ-VAE), a method for learning decompositions of dense encodings. We use HRQ-VAE to encode the syntactic form of an input sentence as a path through the hierarchy, allowing us to more easily predict syntactic sketches at test time.
arXiv Detail & Related papers (2022-03-07T15:28:36Z)
TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining. We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time. Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z)
Project-Level Encoding for Neural Source Code Summarization of Subroutines [6.939768185086755]
We present a project-level encoder to improve models of code summarization. We use that representation to augment the encoder of state-of-the-art neural code summarization techniques.
arXiv Detail & Related papers (2021-03-22T06:01:07Z)
Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder. We train a Transformer-based sequence encoder over a large set of short sequences. Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z)
A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens. We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
Learning Syntactic and Dynamic Selective Encoding for Document Summarization [17.666036645395845]
We propose a novel neural architecture for document summarization. We incorporate syntactic information such as constituency parsing trees into the encoding sequence. We propose a dynamic gate network to select the salient information based on the context of the decoder state.
arXiv Detail & Related papers (2020-03-25T01:29:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.