Progressively Pretrained Dense Corpus Index for Open-Domain Question
Answering
- URL: http://arxiv.org/abs/2005.00038v2
- Date: Fri, 19 Feb 2021 04:37:42 GMT
- Title: Progressively Pretrained Dense Corpus Index for Open-Domain Question
Answering
- Authors: Wenhan Xiong, Hong Wang, William Yang Wang
- Abstract summary: We propose a simple and resource-efficient method to pretrain the paragraph encoder.
Our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.
- Score: 87.32442219333046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To extract answers from a large corpus, open-domain question answering (QA)
systems usually rely on information retrieval (IR) techniques to narrow the
search space. Standard inverted index methods such as TF-IDF are commonly used
as thanks to their efficiency. However, their retrieval performance is limited
as they simply use shallow and sparse lexical features. To break the IR
bottleneck, recent studies show that stronger retrieval performance can be
achieved by pretraining a effective paragraph encoder that index paragraphs
into dense vectors. Once trained, the corpus can be pre-encoded into
low-dimensional vectors and stored within an index structure where the
retrieval can be efficiently implemented as maximum inner product search.
Despite the promising results, pretraining such a dense index is expensive
and often requires a very large batch size. In this work, we propose a simple
and resource-efficient method to pretrain the paragraph encoder. First, instead
of using heuristically created pseudo question-paragraph pairs for pretraining,
we utilize an existing pretrained sequence-to-sequence model to build a strong
question generator that creates high-quality pretraining data. Second, we
propose a progressive pretraining algorithm to ensure the existence of
effective negative samples in each batch. Across three datasets, our method
outperforms an existing dense retrieval method that uses 7 times more
computational resources for pretraining.
Related papers
- Semi-Parametric Retrieval via Binary Token Index [71.78109794895065]
Semi-parametric Vocabulary Disentangled Retrieval (SVDR) is a novel semi-parametric retrieval framework.
It supports two types of indexes: an embedding-based index for high effectiveness, akin to existing neural retrieval methods; and a binary token index that allows for quick and cost-effective setup, resembling traditional term-based retrieval.
It achieves a 3% higher top-1 retrieval accuracy compared to the dense retriever DPR when using an embedding-based index and a 9% higher top-1 accuracy compared to BM25 when using a binary token index.
arXiv Detail & Related papers (2024-05-03T08:34:13Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [59.359325855708974]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid.
Our results reveal that proposition-based retrieval significantly outperforms traditional passage or sentence-based methods in dense retrieval.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - Augmented Embeddings for Custom Retrievals [13.773007276544913]
We introduce Adapted Dense Retrieval, a mechanism to transform embeddings to enable improved task-specific, heterogeneous and strict retrieval.
Dense Retrieval works by learning a low-rank residual adaptation of the pretrained black-box embedding.
arXiv Detail & Related papers (2023-10-09T03:29:35Z) - Lexically-Accelerated Dense Retrieval [29.327878974130055]
'LADR' (Lexically-Accelerated Dense Retrieval) is a simple-yet-effective approach that improves the efficiency of existing dense retrieval models.
LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.
arXiv Detail & Related papers (2023-07-31T15:44:26Z) - Bridging the Training-Inference Gap for Dense Phrase Retrieval [104.4836127502683]
Building dense retrievers requires a series of standard procedures, including training and validating neural models.
In this paper, we explore how the gap between training and inference in dense retrieval can be reduced.
We propose an efficient way of validating dense retrievers using a small subset of the entire corpus.
arXiv Detail & Related papers (2022-10-25T00:53:06Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - A New Sentence Ordering Method Using BERT Pretrained Model [2.1793134762413433]
We propose a method for sentence ordering which does not need a training phase and consequently a large corpus for learning.
Our proposed method outperformed other baselines on ROCStories, a corpus of 5-sentence human-made stories.
Among other advantages of this method are its interpretability and needlessness to linguistic knowledge.
arXiv Detail & Related papers (2021-08-26T18:47:15Z) - Best-First Beam Search [78.71330480725668]
We show that the standard implementation of beam search can be made up to 10x faster in practice.
We propose a memory-reduced variant of Best-First Beam Search, which has a similar beneficial search bias in terms of downstream performance.
arXiv Detail & Related papers (2020-07-08T05:56:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.