Deep Graph Matching and Searching for Semantic Code Retrieval
- URL: http://arxiv.org/abs/2010.12908v2
- Date: Fri, 22 Jan 2021 16:38:09 GMT
- Title: Deep Graph Matching and Searching for Semantic Code Retrieval
- Authors: Xiang Ling, Lingfei Wu, Saizhuo Wang, Gaoning Pan, Tengfei Ma, Fangli
Xu, Alex X. Liu, Chunming Wu, Shouling Ji
- Abstract summary: We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
- Score: 76.51445515611469
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code retrieval is to find the code snippet from a large corpus of source code
repositories that highly matches the query of natural language description.
Recent work mainly uses natural language processing techniques to process both
query texts (i.e., human natural language) and code snippets (i.e., machine
programming language), however neglecting the deep structured features of query
texts and source codes, both of which contain rich semantic information. In
this paper, we propose an end-to-end deep graph matching and searching (DGMS)
model based on graph neural networks for the task of semantic code retrieval.
To this end, we first represent both natural language query texts and
programming language code snippets with the unified graph-structured data, and
then use the proposed graph matching and searching model to retrieve the best
matching code snippet. In particular, DGMS not only captures more structural
information for individual query texts or code snippets but also learns the
fine-grained similarity between them by cross-attention based semantic matching
operations. We evaluate the proposed DGMS model on two public code retrieval
datasets with two representative programming languages (i.e., Java and Python).
Experiment results demonstrate that DGMS significantly outperforms
state-of-the-art baseline models by a large margin on both datasets. Moreover,
our extensive ablation studies systematically investigate and illustrate the
impact of each part of DGMS.
Related papers
- CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities.
name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains.
We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - MURMUR: Modular Multi-Step Reasoning for Semi-Structured Data-to-Text
Generation [102.20036684996248]
We propose MURMUR, a neuro-symbolic modular approach to text generation from semi-structured data with multi-step reasoning.
We conduct experiments on two data-to-text generation tasks like WebNLG and LogicNLG.
arXiv Detail & Related papers (2022-12-16T17:36:23Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - GraphSearchNet: Enhancing GNNs via Capturing Global Dependency for
Semantic Code Search [15.687959123626003]
We design a novel neural network framework, named GraphSearchNet, to enable an effective and accurate source code search.
Specifically, we propose to encode both source code and queries into two graphs with BiGGNN to capture the local structure information of the graphs.
The experiments on both Java and Python datasets illustrate that GraphSearchNet outperforms current state-of-the-art works by a significant margin.
arXiv Detail & Related papers (2021-11-04T07:38:35Z) - Multimodal Representation for Neural Code Search [18.371048875103497]
We introduce tree-serialization methods on a simplified form of AST and build the multimodal representation for the code data.
Our results show that both our tree-serialized representations and multimodal learning model improve the performance of neural code search.
arXiv Detail & Related papers (2021-07-02T12:08:19Z) - deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search [15.19181807445119]
We propose a learnable deep Graph for Code Search (called deGraphCS) to transfer source code into variable-based flow graphs.
We collect a large-scale dataset from GitHub containing 41,152 code snippets written in C language.
arXiv Detail & Related papers (2021-03-24T06:57:44Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.