Progressively Pretrained Dense Corpus Index for Open-Domain Question
Answering
- URL: http://arxiv.org/abs/2005.00038v2
- Date: Fri, 19 Feb 2021 04:37:42 GMT
- Title: Progressively Pretrained Dense Corpus Index for Open-Domain Question
Answering
- Authors: Wenhan Xiong, Hong Wang, William Yang Wang
- Abstract summary: We propose a simple and resource-efficient method to pretrain the paragraph encoder.
Our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.
- Score: 87.32442219333046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To extract answers from a large corpus, open-domain question answering (QA)
systems usually rely on information retrieval (IR) techniques to narrow the
search space. Standard inverted index methods such as TF-IDF are commonly used
as thanks to their efficiency. However, their retrieval performance is limited
as they simply use shallow and sparse lexical features. To break the IR
bottleneck, recent studies show that stronger retrieval performance can be
achieved by pretraining a effective paragraph encoder that index paragraphs
into dense vectors. Once trained, the corpus can be pre-encoded into
low-dimensional vectors and stored within an index structure where the
retrieval can be efficiently implemented as maximum inner product search.
Despite the promising results, pretraining such a dense index is expensive
and often requires a very large batch size. In this work, we propose a simple
and resource-efficient method to pretrain the paragraph encoder. First, instead
of using heuristically created pseudo question-paragraph pairs for pretraining,
we utilize an existing pretrained sequence-to-sequence model to build a strong
question generator that creates high-quality pretraining data. Second, we
propose a progressive pretraining algorithm to ensure the existence of
effective negative samples in each batch. Across three datasets, our method
outperforms an existing dense retrieval method that uses 7 times more
computational resources for pretraining.
Related papers
- Early Exit Strategies for Approximate k-NN Search in Dense Retrieval [10.48678957367324]
We build upon state-of-the-art for early exit A-kNN and propose an unsupervised method based on the notion of patience.
We show that our techniques improve the A-kNN efficiency with up to 5x speedups while achieving negligible effectiveness losses.
arXiv Detail & Related papers (2024-08-09T10:17:07Z) - DeeperImpact: Optimizing Sparse Learned Index Structures [4.92919246305126]
We focus on narrowing the effectiveness gap with the most effective versions of SPLADE.
Our results substantially narrow the effectiveness gap with the most effective versions of SPLADE.
arXiv Detail & Related papers (2024-05-27T12:08:59Z) - Semi-Parametric Retrieval via Binary Token Index [71.78109794895065]
Semi-parametric Vocabulary Disentangled Retrieval (SVDR) is a novel semi-parametric retrieval framework.
It supports two types of indexes: an embedding-based index for high effectiveness, akin to existing neural retrieval methods; and a binary token index that allows for quick and cost-effective setup, resembling traditional term-based retrieval.
It achieves a 3% higher top-1 retrieval accuracy compared to the dense retriever DPR when using an embedding-based index and a 9% higher top-1 accuracy compared to BM25 when using a binary token index.
arXiv Detail & Related papers (2024-05-03T08:34:13Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - Augmented Embeddings for Custom Retrievals [13.773007276544913]
We introduce Adapted Dense Retrieval, a mechanism to transform embeddings to enable improved task-specific, heterogeneous and strict retrieval.
Dense Retrieval works by learning a low-rank residual adaptation of the pretrained black-box embedding.
arXiv Detail & Related papers (2023-10-09T03:29:35Z) - Lexically-Accelerated Dense Retrieval [29.327878974130055]
'LADR' (Lexically-Accelerated Dense Retrieval) is a simple-yet-effective approach that improves the efficiency of existing dense retrieval models.
LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.
arXiv Detail & Related papers (2023-07-31T15:44:26Z) - Bridging the Training-Inference Gap for Dense Phrase Retrieval [104.4836127502683]
Building dense retrievers requires a series of standard procedures, including training and validating neural models.
In this paper, we explore how the gap between training and inference in dense retrieval can be reduced.
We propose an efficient way of validating dense retrievers using a small subset of the entire corpus.
arXiv Detail & Related papers (2022-10-25T00:53:06Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Best-First Beam Search [78.71330480725668]
We show that the standard implementation of beam search can be made up to 10x faster in practice.
We propose a memory-reduced variant of Best-First Beam Search, which has a similar beneficial search bias in terms of downstream performance.
arXiv Detail & Related papers (2020-07-08T05:56:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.