Related papers: Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering

Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering

URL: http://arxiv.org/abs/2005.00038v2
Date: Fri, 19 Feb 2021 04:37:42 GMT
Title: Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering
Authors: Wenhan Xiong, Hong Wang, William Yang Wang
Abstract summary: We propose a simple and resource-efficient method to pretrain the paragraph encoder. Our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.
Score: 87.32442219333046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To extract answers from a large corpus, open-domain question answering (QA) systems usually rely on information retrieval (IR) techniques to narrow the search space. Standard inverted index methods such as TF-IDF are commonly used as thanks to their efficiency. However, their retrieval performance is limited as they simply use shallow and sparse lexical features. To break the IR bottleneck, recent studies show that stronger retrieval performance can be achieved by pretraining a effective paragraph encoder that index paragraphs into dense vectors. Once trained, the corpus can be pre-encoded into low-dimensional vectors and stored within an index structure where the retrieval can be efficiently implemented as maximum inner product search. Despite the promising results, pretraining such a dense index is expensive and often requires a very large batch size. In this work, we propose a simple and resource-efficient method to pretrain the paragraph encoder. First, instead of using heuristically created pseudo question-paragraph pairs for pretraining, we utilize an existing pretrained sequence-to-sequence model to build a strong question generator that creates high-quality pretraining data. Second, we propose a progressive pretraining algorithm to ensure the existence of effective negative samples in each batch. Across three datasets, our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.

Related papers

Early Exit Strategies for Approximate k-NN Search in Dense Retrieval [10.48678957367324]
We build upon state-of-the-art for early exit A-kNN and propose an unsupervised method based on the notion of patience. We show that our techniques improve the A-kNN efficiency with up to 5x speedups while achieving negligible effectiveness losses.
arXiv Detail & Related papers (2024-08-09T10:17:07Z)
Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval [108.9772640854136]
Generative retrieval uses differentiable search indexes to directly generate relevant document identifiers in response to a query. Recent studies have highlighted the potential of a strong generative retrieval model, trained with carefully crafted pre-training tasks, to enhance downstream retrieval tasks via fine-tuning. We introduce BootRet, a bootstrapped pre-training method for generative retrieval that dynamically adjusts document identifiers during pre-training to accommodate the continuing of the corpus.
arXiv Detail & Related papers (2024-07-16T08:42:36Z)
DeeperImpact: Optimizing Sparse Learned Index Structures [4.92919246305126]
We focus on narrowing the effectiveness gap with the most effective versions of SPLADE. Our results substantially narrow the effectiveness gap with the most effective versions of SPLADE.
arXiv Detail & Related papers (2024-05-27T12:08:59Z)
Semi-Parametric Retrieval via Binary Token Index [71.78109794895065]
Semi-parametric Vocabulary Disentangled Retrieval (SVDR) is a novel semi-parametric retrieval framework. It supports two types of indexes: an embedding-based index for high effectiveness, akin to existing neural retrieval methods; and a binary token index that allows for quick and cost-effective setup, resembling traditional term-based retrieval. It achieves a 3% higher top-1 retrieval accuracy compared to the dense retriever DPR when using an embedding-based index and a 9% higher top-1 accuracy compared to BM25 when using a binary token index.
arXiv Detail & Related papers (2024-05-03T08:34:13Z)
Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We introduce a novel retrieval unit, proposition, for dense retrieval. Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z)
Augmented Embeddings for Custom Retrievals [13.773007276544913]
We introduce Adapted Dense Retrieval, a mechanism to transform embeddings to enable improved task-specific, heterogeneous and strict retrieval. Dense Retrieval works by learning a low-rank residual adaptation of the pretrained black-box embedding.
arXiv Detail & Related papers (2023-10-09T03:29:35Z)
Lexically-Accelerated Dense Retrieval [29.327878974130055]
'LADR' (Lexically-Accelerated Dense Retrieval) is a simple-yet-effective approach that improves the efficiency of existing dense retrieval models. LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.
arXiv Detail & Related papers (2023-07-31T15:44:26Z)
Bridging the Training-Inference Gap for Dense Phrase Retrieval [104.4836127502683]
Building dense retrievers requires a series of standard procedures, including training and validating neural models. In this paper, we explore how the gap between training and inference in dense retrieval can be reduced. We propose an efficient way of validating dense retrievers using a small subset of the entire corpus.
arXiv Detail & Related papers (2022-10-25T00:53:06Z)
Autoregressive Search Engines: Generating Substrings as Document Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers. Previous work has explored ways to partition the search space into hierarchical structures. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
Best-First Beam Search [78.71330480725668]
We show that the standard implementation of beam search can be made up to 10x faster in practice. We propose a memory-reduced variant of Best-First Beam Search, which has a similar beneficial search bias in terms of downstream performance.
arXiv Detail & Related papers (2020-07-08T05:56:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.