Is a Single Model Enough? MuCoS: A Multi-Model Ensemble Learning for
Semantic Code Search
- URL: http://arxiv.org/abs/2107.04773v2
- Date: Tue, 13 Jul 2021 02:42:51 GMT
- Title: Is a Single Model Enough? MuCoS: A Multi-Model Ensemble Learning for
Semantic Code Search
- Authors: Lun Du, Xiaozhou Shi, Yanlin Wang, Ensheng Shi, Shi Han and Dongmei
Zhang
- Abstract summary: We propose MuCoS, a multi-model ensemble learning architecture for semantic code search.
We train the individual learners on different datasets which contain different perspectives of code information.
Then we ensemble the learners to capture comprehensive features of code snippets.
- Score: 22.9351865820122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, deep learning methods have become mainstream in code search since
they do better at capturing semantic correlations between code snippets and
search queries and have promising performance. However, code snippets have
diverse information from different dimensions, such as business logic, specific
algorithm, and hardware communication, so it is hard for a single code
representation module to cover all the perspectives. On the other hand, as a
specific query may focus on one or several perspectives, it is difficult for a
single query representation module to represent different user intents. In this
paper, we propose MuCoS, a multi-model ensemble learning architecture for
semantic code search. It combines several individual learners, each of which
emphasizes a specific perspective of code snippets. We train the individual
learners on different datasets which contain different perspectives of code
information, and we use a data augmentation strategy to get these different
datasets. Then we ensemble the learners to capture comprehensive features of
code snippets.
Related papers
- Survey of Code Search Based on Deep Learning [11.94599964179766]
This survey focuses on code search, that is, to retrieve code that matches a given query.
Deep learning, being able to extract complex semantics information, has achieved great success in this field.
We propose a new taxonomy to illustrate the state-of-the-art deep learning-based code search.
arXiv Detail & Related papers (2023-05-10T08:07:04Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Probing Pretrained Models of Source Code [14.904366372190943]
General pretrained models have been shown to outperform task-specific models in many applications.
We show that pretrained models of code indeed contain information about code syntactic structure and correctness, the notions of identifiers, data flow and correctnesss, and natural language naming.
arXiv Detail & Related papers (2022-02-16T10:26:14Z) - Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus [17.6095840480926]
We propose a novel deep semantic model which makes use of the utilities of multi-modal sources.
We apply the proposed model to tackle the CodeSearchNet challenge about semantic code search.
Our model is trained on CodeSearchNet corpus and evaluated on the held-out data, the final model achieves 0.384 NDCG and won the first place in this benchmark.
arXiv Detail & Related papers (2022-01-27T04:15:59Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - Contrastive Learning for Source Code with Structural and Functional
Properties [66.10710134948478]
We present BOOST, a novel self-supervised model to focus pre-training based on the characteristics of source code.
We employ automated, structure-guided code transformation algorithms that generate functionally equivalent code that looks drastically different from the original one.
We train our model in a way that brings the functionally equivalent code closer and distinct code further through a contrastive learning objective.
arXiv Detail & Related papers (2021-10-08T02:56:43Z) - Multimodal Representation for Neural Code Search [18.371048875103497]
We introduce tree-serialization methods on a simplified form of AST and build the multimodal representation for the code data.
Our results show that both our tree-serialized representations and multimodal learning model improve the performance of neural code search.
arXiv Detail & Related papers (2021-07-02T12:08:19Z) - Multimodal Clustering Networks for Self-supervised Learning from
Unlabeled Videos [69.61522804742427]
This paper proposes a self-supervised training framework that learns a common multimodal embedding space.
We extend the concept of instance-level contrastive learning with a multimodal clustering step to capture semantic similarities across modalities.
The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains.
arXiv Detail & Related papers (2021-04-26T15:55:01Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.