Retriever and Ranker Framework with Probabilistic Hard Negative Sampling
for Code Search
- URL: http://arxiv.org/abs/2305.04508v1
- Date: Mon, 8 May 2023 07:04:28 GMT
- Title: Retriever and Ranker Framework with Probabilistic Hard Negative Sampling
for Code Search
- Authors: Hande Dong, Jiayi Lin, Yichong Leng, Jiawei Chen, Yutao Xie
- Abstract summary: We introduce a cross-encoder architecture for code search that jointly encodes the semantic matching of query and code.
We also introduce a Retriever-Ranker framework that cascades the dual-encoder and cross-encoder to promote the efficiency of evaluation and online serving.
- Score: 11.39443308694887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained Language Models (PLMs) have emerged as the state-of-the-art
paradigm for code search tasks. The paradigm involves pretraining the model on
search-irrelevant tasks such as masked language modeling, followed by the
finetuning stage, which focuses on the search-relevant task. The typical
finetuning method is to employ a dual-encoder architecture to encode semantic
embeddings of query and code separately, and then calculate their similarity
based on the embeddings.
However, the typical dual-encoder architecture falls short in modeling
token-level interactions between query and code, which limits the model's
capabilities. In this paper, we propose a novel approach to address this
limitation, introducing a cross-encoder architecture for code search that
jointly encodes the semantic matching of query and code. We further introduce a
Retriever-Ranker (RR) framework that cascades the dual-encoder and
cross-encoder to promote the efficiency of evaluation and online serving.
Moreover, we present a probabilistic hard negative sampling method to improve
the cross-encoder's ability to distinguish hard negative codes, which further
enhances the cascade RR framework. Experiments on four datasets using three
code PLMs demonstrate the superiority of our proposed method.
Related papers
- Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - Can the Query-based Object Detector Be Designed with Fewer Stages? [15.726619371300558]
We propose a novel model called GOLO, which follows a two-stage decoding paradigm.
Compared to other mainstream query-based models with multi-stage decoders, our model employs fewer decoder stages while still achieving considerable performance.
arXiv Detail & Related papers (2023-09-28T09:58:52Z) - Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix
Factorization [60.91600465922932]
We present an approach that avoids the use of a dual-encoder for retrieval, relying solely on the cross-encoder.
Our approach provides test-time recall-vs-computational cost trade-offs superior to the current widely-used methods.
arXiv Detail & Related papers (2022-10-23T00:32:04Z) - Revisiting Code Search in a Two-Stage Paradigm [67.02322603435628]
TOSS is a two-stage fusion code search framework.
It first uses IR-based and bi-encoder models to efficiently recall a small number of top-k code candidates.
It then uses fine-grained cross-encoders for finer ranking.
arXiv Detail & Related papers (2022-08-24T02:34:27Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z) - On the Importance of Building High-quality Training Datasets for Neural
Code Search [15.557818317497397]
We propose a data cleaning framework consisting of two subsequent filters: a rule-based syntactic filter and a model-based semantic filter.
We evaluate the effectiveness of our framework on two widely-used code search models and three manually-annotated code retrieval benchmarks.
arXiv Detail & Related papers (2022-02-14T12:02:41Z) - Leveraging Code Generation to Improve Code Retrieval and Summarization
via Dual Learning [18.354352985591305]
Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query.
Recent studies have combined these two tasks to improve their performance.
We propose a novel end-to-end model for the two tasks by introducing an additional code generation task.
arXiv Detail & Related papers (2020-02-24T12:26:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.