Bridging the Training-Inference Gap for Dense Phrase Retrieval
- URL: http://arxiv.org/abs/2210.13678v1
- Date: Tue, 25 Oct 2022 00:53:06 GMT
- Title: Bridging the Training-Inference Gap for Dense Phrase Retrieval
- Authors: Gyuwan Kim, Jinhyuk Lee, Barlas Oguz, Wenhan Xiong, Yizhe Zhang,
Yashar Mehdad, William Yang Wang
- Abstract summary: Building dense retrievers requires a series of standard procedures, including training and validating neural models.
In this paper, we explore how the gap between training and inference in dense retrieval can be reduced.
We propose an efficient way of validating dense retrievers using a small subset of the entire corpus.
- Score: 104.4836127502683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building dense retrievers requires a series of standard procedures, including
training and validating neural models and creating indexes for efficient
search. However, these procedures are often misaligned in that training
objectives do not exactly reflect the retrieval scenario at inference time. In
this paper, we explore how the gap between training and inference in dense
retrieval can be reduced, focusing on dense phrase retrieval (Lee et al., 2021)
where billions of representations are indexed at inference. Since validating
every dense retriever with a large-scale index is practically infeasible, we
propose an efficient way of validating dense retrievers using a small subset of
the entire corpus. This allows us to validate various training strategies
including unifying contrastive loss terms and using hard negatives for phrase
retrieval, which largely reduces the training-inference discrepancy. As a
result, we improve top-1 phrase retrieval accuracy by 2~3 points and top-20
passage retrieval accuracy by 2~4 points for open-domain question answering.
Our work urges modeling dense retrievers with careful consideration of training
and inference via efficient validation while advancing phrase retrieval as a
general solution for dense retrieval.
Related papers
- Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers [6.773411876899064]
inference-free sparse models lag far behind in terms of search relevance when compared to both sparse and dense siamese models.
We propose two different approaches for performance improvement. First, we introduce the IDF-aware FLOPS loss, which introduces Inverted Document Frequency (IDF) to the sparsification of representations.
We find that it mitigates the negative impact of the FLOPS regularization on search relevance, allowing the model to achieve a better balance between accuracy and efficiency.
arXiv Detail & Related papers (2024-11-07T03:46:43Z) - Improve Dense Passage Retrieval with Entailment Tuning [22.39221206192245]
Key to a retrieval system is to calculate relevance scores to query and passage pairs.
We observed that a major class of relevance aligns with the concept of entailment in NLI tasks.
We design a method called entailment tuning to improve the embedding of dense retrievers.
arXiv Detail & Related papers (2024-10-21T09:18:30Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - LED: Lexicon-Enlightened Dense Retriever for Large-Scale Retrieval [68.85686621130111]
We propose to make a dense retriever align a well-performing lexicon-aware representation model.
We evaluate our model on three public benchmarks, which shows that with a comparable lexicon-aware retriever as the teacher, our proposed dense model can bring consistent and significant improvements.
arXiv Detail & Related papers (2022-08-29T15:09:28Z) - LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text
Retrieval [55.097573036580066]
Experimental results show that LaPraDoR achieves state-of-the-art performance compared with supervised dense retrieval models.
Compared to re-ranking, our lexicon-enhanced approach can be run in milliseconds (22.5x faster) while achieving superior performance.
arXiv Detail & Related papers (2022-03-11T18:53:12Z) - Phrase Retrieval Learns Passage Retrieval, Too [77.57208968326422]
We study whether phrase retrieval can serve as the basis for coarse-level retrieval including passages and documents.
We show that a dense phrase-retrieval system, without any retraining, already achieves better passage retrieval accuracy.
We also show that phrase filtering and vector quantization can reduce the size of our index by 4-10x.
arXiv Detail & Related papers (2021-09-16T17:42:45Z) - Approximate Nearest Neighbor Negative Contrastive Learning for Dense
Text Retrieval [20.62375162628628]
This paper presents Approximate nearest neighbor Negative Contrastive Estimation (ANCE), a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus.
In our experiments, ANCE boosts the BERT-Siamese DR model to outperform all competitive dense and sparse retrieval baselines.
It nearly matches the accuracy of sparse-retrieval-and-BERT-reranking using dot-product in the ANCE-learned representation space and provides almost 100x speed-up.
arXiv Detail & Related papers (2020-07-01T23:15:56Z) - Progressively Pretrained Dense Corpus Index for Open-Domain Question
Answering [87.32442219333046]
We propose a simple and resource-efficient method to pretrain the paragraph encoder.
Our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.
arXiv Detail & Related papers (2020-04-30T18:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.