Learning to Retrieve Passages without Supervision
- URL: http://arxiv.org/abs/2112.07708v1
- Date: Tue, 14 Dec 2021 19:18:08 GMT
- Title: Learning to Retrieve Passages without Supervision
- Authors: Ori Ram, Gal Shachaf, Omer Levy, Jonathan Berant, Amir Globerson
- Abstract summary: Dense retrievers for open-domain question answering (ODQA) have been shown to achieve impressive performance by training on large datasets of question-passage pairs.
We investigate whether dense retrievers can be learned in a self-supervised fashion, and applied effectively without any annotations.
- Score: 58.31911597824848
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Dense retrievers for open-domain question answering (ODQA) have been shown to
achieve impressive performance by training on large datasets of
question-passage pairs. We investigate whether dense retrievers can be learned
in a self-supervised fashion, and applied effectively without any annotations.
We observe that existing pretrained models for retrieval struggle in this
scenario, and propose a new pretraining scheme designed for retrieval:
recurring span retrieval. We use recurring spans across passages in a document
to create pseudo examples for contrastive learning. The resulting model --
Spider -- performs surprisingly well without any examples on a wide range of
ODQA datasets, and is competitive with BM25, a strong sparse baseline. In
addition, Spider often outperforms strong baselines like DPR trained on Natural
Questions, when evaluated on questions from other datasets. Our hybrid
retriever, which combines Spider with BM25, improves over its components across
all datasets, and is often competitive with in-domain DPR models, which are
trained on tens of thousands of examples.
Related papers
- W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain Question Answering [28.79851078451609]
Large Language Models (LLMs) often struggle to generate factual answers relying solely on their internal (parametric) knowledge.
To address this limitation, Retrieval-Augmented Generation (RAG) systems enhance LLMs by retrieving relevant information from external sources.
We propose W-RAG by utilizing the ranking capabilities of LLMs to create weakly labeled data for training dense retrievers.
arXiv Detail & Related papers (2024-08-15T22:34:44Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z) - SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot
Neural Sparse Retrieval [92.27387459751309]
We provide SPRINT, a unified Python toolkit for evaluating neural sparse retrieval.
We establish strong and reproducible zero-shot sparse retrieval baselines across the well-acknowledged benchmark, BEIR.
We show that SPLADEv2 produces sparse representations with a majority of tokens outside of the original query and document.
arXiv Detail & Related papers (2023-07-19T22:48:02Z) - Unsupervised Dense Retrieval with Relevance-Aware Contrastive
Pre-Training [81.3781338418574]
We propose relevance-aware contrastive learning.
We consistently improve the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks.
Our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner.
arXiv Detail & Related papers (2023-06-05T18:20:27Z) - Unified Demonstration Retriever for In-Context Learning [56.06473069923567]
Unified Demonstration Retriever (textbfUDR) is a single model to retrieve demonstrations for a wide range of tasks.
We propose a multi-task list-wise ranking training framework, with an iterative mining strategy to find high-quality candidates.
Experiments on 30+ tasks across 13 task families and multiple data domains show that UDR significantly outperforms baselines.
arXiv Detail & Related papers (2023-05-07T16:07:11Z) - Towards Unsupervised Dense Information Retrieval with Contrastive
Learning [38.42033176712396]
We show that contrastive learning can be used to train unsupervised dense retrievers.
Our model outperforms BM25 on 11 out of 15 datasets.
arXiv Detail & Related papers (2021-12-16T18:57:37Z) - End-to-End Training of Neural Retrievers for Open-Domain Question
Answering [32.747113232867825]
It remains unclear how unsupervised and supervised methods can be used most effectively for neural retrievers.
We propose an approach of unsupervised pre-training with the Inverse Cloze Task and masked salient spans.
We also explore two approaches for end-to-end supervised training of the reader and retriever components in OpenQA models.
arXiv Detail & Related papers (2021-01-02T09:05:34Z) - Multi-task Retrieval for Knowledge-Intensive Tasks [21.725935960568027]
We propose a multi-task trained model for neural retrieval.
Our approach not only outperforms previous methods in the few-shot setting, but also rivals specialised neural retrievers.
With the help of our retriever, we improve existing models for downstream tasks and closely match or improve the state of the art on multiple benchmarks.
arXiv Detail & Related papers (2021-01-01T00:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.