CSSAM:Code Search via Attention Matching of Code Semantics and
Structures
- URL: http://arxiv.org/abs/2208.03922v1
- Date: Mon, 8 Aug 2022 05:45:40 GMT
- Title: CSSAM:Code Search via Attention Matching of Code Semantics and
Structures
- Authors: Yi Hu, Bo Cai, Yaoxiang Yu
- Abstract summary: This paper introduces a code search model named CSSAM (Code Semantics and Structures Attention Matching)
By introducing semantic and structural matching mechanisms, CSSAM effectively extracts and fuses multidimensional code features.
By leveraging the residual interaction, a matching module is designed to preserve more code semantics and descriptive features.
- Score: 8.547332796736107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the continuous efforts in improving both the effectiveness and
efficiency of code search, two issues remained unsolved. First, programming
languages have inherent strong structural linkages, and feature mining of code
as text form would omit the structural information contained inside it. Second,
there is a potential semantic relationship between code and query, it is
challenging to align code and text across sequences so that vectors are
spatially consistent during similarity matching. To tackle both issues, in this
paper, a code search model named CSSAM (Code Semantics and Structures Attention
Matching) is proposed. By introducing semantic and structural matching
mechanisms, CSSAM effectively extracts and fuses multidimensional code
features. Specifically, the cross and residual layer was developed to
facilitate high-latitude spatial alignment of code and query at the token
level. By leveraging the residual interaction, a matching module is designed to
preserve more code semantics and descriptive features, that enhances the
adhesion between the code and its corresponding query text. Besides, to improve
the model's comprehension of the code's inherent structure, a code
representation structure named CSRG (Code Semantic Representation Graph) is
proposed for jointly representing abstract syntax tree nodes and the data flow
of the codes. According to the experimental results on two publicly available
datasets containing 540k and 330k code segments, CSSAM significantly
outperforms the baselines in terms of achieving the highest SR@1/5/10, MRR, and
NDCG@50 on both datasets respectively. Moreover, the ablation study is
conducted to quantitatively measure the impact of each key component of CSSAM
on the efficiency and effectiveness of code search, which offers the insights
into the improvement of advanced code search solutions.
Related papers
- Line-level Semantic Structure Learning for Code Vulnerability Detection [44.29771620061153]
We introduce the Code Structure-Aware Network through Line-level Semantic Learning.
It comprises four components: code preprocessing, global semantic awareness, line semantic awareness, and line semantic structure awareness.
The CSLS model outperforms the state-of-the-art baselines in code vulnerability detection, achieving 70.57% accuracy on the Devign dataset and a 49.59% F1 score on the Reveal dataset.
arXiv Detail & Related papers (2024-07-26T17:15:58Z) - When simplicity meets effectiveness: Detecting code comments coherence with word embeddings and LSTM [6.417777780911223]
Code comments play a crucial role in software development, as they provide programmers with practical information.
Developers tend to leave comments unchanged after updating the code, resulting in a discrepancy between the two artifacts.
It is crucial to identify if, given a code snippet, its corresponding comment is coherent and reflects well the intent behind the code.
arXiv Detail & Related papers (2024-05-25T15:21:27Z) - ConTextual Mask Auto-Encoder for Dense Passage Retrieval [49.49460769701308]
CoT-MAE is a simple yet effective generative pre-training method for dense passage retrieval.
It learns to compress the sentence semantics into a dense vector through self-supervised and context-supervised masked auto-encoding.
We conduct experiments on large-scale passage retrieval benchmarks and show considerable improvements over strong baselines.
arXiv Detail & Related papers (2022-08-16T11:17:22Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.