Questions Are All You Need to Train a Dense Passage Retriever
- URL: http://arxiv.org/abs/2206.10658v4
- Date: Mon, 3 Apr 2023 00:28:34 GMT
- Title: Questions Are All You Need to Train a Dense Passage Retriever
- Authors: Devendra Singh Sachan and Mike Lewis and Dani Yogatama and Luke
Zettlemoyer and Joelle Pineau and Manzil Zaheer
- Abstract summary: ART is a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data.
It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question.
- Score: 123.13872383489172
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce ART, a new corpus-level autoencoding approach for training dense
retrieval models that does not require any labeled training data. Dense
retrieval is a central challenge for open-domain tasks, such as Open QA, where
state-of-the-art methods typically require large supervised datasets with
custom hard-negative mining and denoising of positive examples. ART, in
contrast, only requires access to unpaired inputs and outputs (e.g. questions
and potential answer documents). It uses a new document-retrieval autoencoding
scheme, where (1) an input question is used to retrieve a set of evidence
documents, and (2) the documents are then used to compute the probability of
reconstructing the original question. Training for retrieval based on question
reconstruction enables effective unsupervised learning of both document and
question encoders, which can be later incorporated into complete Open QA
systems without any further finetuning. Extensive experiments demonstrate that
ART obtains state-of-the-art results on multiple QA retrieval benchmarks with
only generic initialization from a pre-trained language model, removing the
need for labeled data and task-specific losses.
Related papers
- Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual
Question Answering [16.52970318866536]
This paper studies a category of visual question answering tasks, in which accessing external knowledge is necessary for answering the questions.
A major step in developing OK-VQA systems is to retrieve relevant documents for the given multi-modal query.
We propose an automatic data generation pipeline for pre-training passage retrieval models for OK-VQA tasks.
arXiv Detail & Related papers (2023-06-28T18:06:40Z) - PIE-QG: Paraphrased Information Extraction for Unsupervised Question
Generation from Small Corpora [4.721845865189576]
PIE-QG uses Open Information Extraction (OpenIE) to generate synthetic training questions from paraphrased passages.
Triples in the form of subject, predicate, object> are extracted from each passage, and questions are formed with subjects (or objects) and predicates while objects (or subjects) are considered as answers.
arXiv Detail & Related papers (2023-01-03T12:20:51Z) - Retrieval as Attention: End-to-end Learning of Retrieval and Reading
within a Single Transformer [80.50327229467993]
We show that a single model trained end-to-end can achieve both competitive retrieval and QA performance.
We show that end-to-end adaptation significantly boosts its performance on out-of-domain datasets in both supervised and unsupervised settings.
arXiv Detail & Related papers (2022-12-05T04:51:21Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Weakly Supervised Pre-Training for Multi-Hop Retriever [23.79574380039197]
We propose a new method for weakly supervised multi-hop retriever pre-training without human efforts.
Our method includes 1) a pre-training task for generating vector representations of complex questions, 2) a scalable data generation method that produces the nested structure of question and sub-question as weak supervision for pre-training, and 3) a pre-training model structure based on dense encoders.
arXiv Detail & Related papers (2021-06-18T08:06:02Z) - Distilling Knowledge from Reader to Retriever for Question Answering [16.942581590186343]
We propose a technique to learn retriever models for downstream tasks, inspired by knowledge distillation.
We evaluate our method on question answering, obtaining state-of-the-art results.
arXiv Detail & Related papers (2020-12-08T17:36:34Z) - Open Question Answering over Tables and Text [55.8412170633547]
In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question.
Most open QA systems have considered only retrieving information from unstructured text.
We present a new large-scale dataset Open Table-and-Text Question Answering (OTT-QA) to evaluate performance on this task.
arXiv Detail & Related papers (2020-10-20T16:48:14Z) - Pre-training Tasks for Embedding-based Large-scale Retrieval [68.01167604281578]
We consider the large-scale query-document retrieval problem.
Given a query (e.g., a question), return the set of relevant documents from a large document corpus.
We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks.
arXiv Detail & Related papers (2020-02-10T16:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.