LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text
Retrieval
- URL: http://arxiv.org/abs/2203.06169v1
- Date: Fri, 11 Mar 2022 18:53:12 GMT
- Title: LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text
Retrieval
- Authors: Canwen Xu and Daya Guo and Nan Duan and Julian McAuley
- Abstract summary: Experimental results show that LaPraDoR achieves state-of-the-art performance compared with supervised dense retrieval models.
Compared to re-ranking, our lexicon-enhanced approach can be run in milliseconds (22.5x faster) while achieving superior performance.
- Score: 55.097573036580066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose LaPraDoR, a pretrained dual-tower dense retriever
that does not require any supervised data for training. Specifically, we first
present Iterative Contrastive Learning (ICoL) that iteratively trains the query
and document encoders with a cache mechanism. ICoL not only enlarges the number
of negative instances but also keeps representations of cached examples in the
same hidden space. We then propose Lexicon-Enhanced Dense Retrieval (LEDR) as a
simple yet effective way to enhance dense retrieval with lexical matching. We
evaluate LaPraDoR on the recently proposed BEIR benchmark, including 18
datasets of 9 zero-shot text retrieval tasks. Experimental results show that
LaPraDoR achieves state-of-the-art performance compared with supervised dense
retrieval models, and further analysis reveals the effectiveness of our
training strategy and objectives. Compared to re-ranking, our lexicon-enhanced
approach can be run in milliseconds (22.5x faster) while achieving superior
performance.
Related papers
- A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks [81.2624272756733]
In dense retrieval, deep encoders provide embeddings for both inputs and targets.
We train a small parametric corrector network that adjusts stale cached target embeddings.
Our approach matches state-of-the-art results even when no target embedding updates are made during training.
arXiv Detail & Related papers (2024-09-03T13:29:13Z) - Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors.
We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval.
Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets.
Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z) - Lexically-Accelerated Dense Retrieval [29.327878974130055]
'LADR' (Lexically-Accelerated Dense Retrieval) is a simple-yet-effective approach that improves the efficiency of existing dense retrieval models.
LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.
arXiv Detail & Related papers (2023-07-31T15:44:26Z) - SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot
Neural Sparse Retrieval [92.27387459751309]
We provide SPRINT, a unified Python toolkit for evaluating neural sparse retrieval.
We establish strong and reproducible zero-shot sparse retrieval baselines across the well-acknowledged benchmark, BEIR.
We show that SPLADEv2 produces sparse representations with a majority of tokens outside of the original query and document.
arXiv Detail & Related papers (2023-07-19T22:48:02Z) - REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models [11.78036105494679]
This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs)
We present the first-ever code search method that encodes dynamic information during training without the need to execute either the corpus under search or the search query at inference time.
arXiv Detail & Related papers (2023-05-05T20:46:56Z) - Zero-Shot Listwise Document Reranking with a Large Language Model [58.64141622176841]
We propose Listwise Reranker with a Large Language Model (LRL), which achieves strong reranking effectiveness without using any task-specific training data.
Experiments on three TREC web search datasets demonstrate that LRL not only outperforms zero-shot pointwise methods when reranking first-stage retrieval results, but can also act as a final-stage reranker.
arXiv Detail & Related papers (2023-05-03T14:45:34Z) - Bridging the Training-Inference Gap for Dense Phrase Retrieval [104.4836127502683]
Building dense retrievers requires a series of standard procedures, including training and validating neural models.
In this paper, we explore how the gap between training and inference in dense retrieval can be reduced.
We propose an efficient way of validating dense retrievers using a small subset of the entire corpus.
arXiv Detail & Related papers (2022-10-25T00:53:06Z) - SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval [11.38022203865326]
SPLADE model provides highly sparse representations and competitive results with respect to state-of-the-art dense and sparse approaches.
We modify the pooling mechanism, benchmark a model solely based on document expansion, and introduce models trained with distillation.
Overall, SPLADE is considerably improved with more than $9$% gains on NDCG@10 on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark.
arXiv Detail & Related papers (2021-09-21T10:43:42Z) - Efficiently Teaching an Effective Dense Retriever with Balanced Topic
Aware Sampling [37.01593605084575]
TAS-Balanced is an efficient topic-aware query and balanced margin sampling technique.
We show that our TAS-Balanced training method achieves state-of-the-art low-latency (64ms per query) results on two TREC Deep Learning Track query sets.
arXiv Detail & Related papers (2021-04-14T16:49:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.