M2TS: Multi-Scale Multi-Modal Approach Based on Transformer for Source
Code Summarization
- URL: http://arxiv.org/abs/2203.09707v1
- Date: Fri, 18 Mar 2022 02:54:06 GMT
- Title: M2TS: Multi-Scale Multi-Modal Approach Based on Transformer for Source
Code Summarization
- Authors: Yuexiu Gao, Chen Lyu
- Abstract summary: Source code summarization aims to generate natural language descriptions of code snippets.
We propose M2TS, a Multi-scale Multi-modal approach based on Transformer for source code Summarization.
We conduct experiments on two Java and one Python datasets, and the experimental results demonstrate that M2TS outperforms current state-of-the-art methods.
- Score: 0.4061135251278187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Source code summarization aims to generate natural language descriptions of
code snippets. Many existing studies learn the syntactic and semantic knowledge
of code snippets from their token sequences and Abstract Syntax Trees (ASTs).
They use the learned code representations as input to code summarization
models, which can accordingly generate summaries describing source code.
Traditional models traverse ASTs as sequences or split ASTs into paths as
input. However, the former loses the structural properties of ASTs, and the
latter destroys the overall structure of ASTs. Therefore, comprehensively
capturing the structural features of ASTs in learning code representations for
source code summarization remains a challenging problem to be solved. In this
paper, we propose M2TS, a Multi-scale Multi-modal approach based on Transformer
for source code Summarization. M2TS uses a multi-scale AST feature extraction
method, which can extract the structures of ASTs more completely and accurately
at multiple local and global levels. To complement missing semantic information
in ASTs, we also obtain code token features, and further combine them with the
extracted AST features using a cross modality fusion method that not only fuses
the syntactic and contextual semantic information of source code, but also
highlights the key features of each modality. We conduct experiments on two
Java and one Python datasets, and the experimental results demonstrate that
M2TS outperforms current state-of-the-art methods. We release our code at
https://github.com/TranSMS/M2TS.
Related papers
- Cool-Fusion: Fuse Large Language Models without Training [73.17551121242602]
emphCool-Fusion is a method that does not require any type of training like the ensemble approaches.
emphCool-Fusion increases accuracy from three strong source LLMs by a significant 8%-17.8%.
arXiv Detail & Related papers (2024-07-29T09:02:19Z) - Nearest Neighbor Speculative Decoding for LLM Generation and Attribution [87.3259169631789]
Nearest Speculative Decoding (NEST) is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources.
NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks.
In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
arXiv Detail & Related papers (2024-05-29T17:55:03Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Abstract Syntax Tree for Programming Language Understanding and
Representation: How Far Are We? [23.52632194060246]
Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering.
The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning.
We compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks.
arXiv Detail & Related papers (2023-12-01T08:37:27Z) - AST-MHSA : Code Summarization using Multi-Head Self-Attention [1.588193964339148]
We present a model, AST-MHSA, that uses multi-head attention to extract semantic information from the abstract syntax tree (AST) of the code.
The model is trained on a dataset of code and summaries, and the parameters are optimized to minimize the loss between the generated summaries and the ground-truth summaries.
arXiv Detail & Related papers (2023-08-10T15:43:46Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z) - AST-Transformer: Encoding Abstract Syntax Trees Efficiently for Code
Summarization [14.225206904493627]
We propose AST-Transformer to efficiently encode tree-structured ASTs.
Experiments show that AST-Transformer outperforms the state-of-arts by a substantial margin.
arXiv Detail & Related papers (2021-12-02T12:57:22Z) - Improving Code Summarization with Block-wise Abstract Syntax Tree
Splitting [15.28941592388958]
Abstract Syntax Tree (AST), which depicts the source code's syntactic structure, has been incorporated to guide the generation of code summaries.
Existing AST based methods suffer from the difficulty of training and generate inadequate code summaries.
We present the Block-wise Abstract Syntax Tree Splitting method (BASTS), which fully utilizes the rich tree-form syntax structure in ASTs.
arXiv Detail & Related papers (2021-03-14T05:04:06Z) - Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description.
We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.