Retriever and Ranker Framework with Probabilistic Hard Negative Sampling
for Code Search
- URL: http://arxiv.org/abs/2305.04508v1
- Date: Mon, 8 May 2023 07:04:28 GMT
- Title: Retriever and Ranker Framework with Probabilistic Hard Negative Sampling
for Code Search
- Authors: Hande Dong, Jiayi Lin, Yichong Leng, Jiawei Chen, Yutao Xie
- Abstract summary: We introduce a cross-encoder architecture for code search that jointly encodes the semantic matching of query and code.
We also introduce a Retriever-Ranker framework that cascades the dual-encoder and cross-encoder to promote the efficiency of evaluation and online serving.
- Score: 11.39443308694887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained Language Models (PLMs) have emerged as the state-of-the-art
paradigm for code search tasks. The paradigm involves pretraining the model on
search-irrelevant tasks such as masked language modeling, followed by the
finetuning stage, which focuses on the search-relevant task. The typical
finetuning method is to employ a dual-encoder architecture to encode semantic
embeddings of query and code separately, and then calculate their similarity
based on the embeddings.
However, the typical dual-encoder architecture falls short in modeling
token-level interactions between query and code, which limits the model's
capabilities. In this paper, we propose a novel approach to address this
limitation, introducing a cross-encoder architecture for code search that
jointly encodes the semantic matching of query and code. We further introduce a
Retriever-Ranker (RR) framework that cascades the dual-encoder and
cross-encoder to promote the efficiency of evaluation and online serving.
Moreover, we present a probabilistic hard negative sampling method to improve
the cross-encoder's ability to distinguish hard negative codes, which further
enhances the cascade RR framework. Experiments on four datasets using three
code PLMs demonstrate the superiority of our proposed method.
Related papers
- Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix
Factorization [60.91600465922932]
We present an approach that avoids the use of a dual-encoder for retrieval, relying solely on the cross-encoder.
Our approach provides test-time recall-vs-computational cost trade-offs superior to the current widely-used methods.
arXiv Detail & Related papers (2022-10-23T00:32:04Z) - Revisiting Code Search in a Two-Stage Paradigm [67.02322603435628]
TOSS is a two-stage fusion code search framework.
It first uses IR-based and bi-encoder models to efficiently recall a small number of top-k code candidates.
It then uses fine-grained cross-encoders for finer ranking.
arXiv Detail & Related papers (2022-08-24T02:34:27Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z) - On the Importance of Building High-quality Training Datasets for Neural
Code Search [15.557818317497397]
We propose a data cleaning framework consisting of two subsequent filters: a rule-based syntactic filter and a model-based semantic filter.
We evaluate the effectiveness of our framework on two widely-used code search models and three manually-annotated code retrieval benchmarks.
arXiv Detail & Related papers (2022-02-14T12:02:41Z) - Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus [17.6095840480926]
We propose a novel deep semantic model which makes use of the utilities of multi-modal sources.
We apply the proposed model to tackle the CodeSearchNet challenge about semantic code search.
Our model is trained on CodeSearchNet corpus and evaluated on the held-out data, the final model achieves 0.384 NDCG and won the first place in this benchmark.
arXiv Detail & Related papers (2022-01-27T04:15:59Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.