Low-Resource Dense Retrieval for Open-Domain Question Answering: A
Comprehensive Survey
- URL: http://arxiv.org/abs/2208.03197v1
- Date: Fri, 5 Aug 2022 14:35:03 GMT
- Title: Low-Resource Dense Retrieval for Open-Domain Question Answering: A
Comprehensive Survey
- Authors: Xiaoyu Shen, Svitlana Vakulenko, Marco del Tredici, Gianni Barlacchi,
Bill Byrne and Adri\`a de Gispert
- Abstract summary: We provide a structured overview of mainstream techniques for low-resource DR.
We divide the techniques into three main categories: (1) only documents are needed; (2) documents and questions are needed; and (3) documents and question-answer pairs are needed.
For every technique, we introduce its general-form algorithm, highlight the open issues and pros and cons. Promising directions are outlined for future research.
- Score: 23.854086903936647
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Dense retrieval (DR) approaches based on powerful pre-trained language models
(PLMs) achieved significant advances and have become a key component for modern
open-domain question-answering systems. However, they require large amounts of
manual annotations to perform competitively, which is infeasible to scale. To
address this, a growing body of research works have recently focused on
improving DR performance under low-resource scenarios. These works differ in
what resources they require for training and employ a diverse set of
techniques. Understanding such differences is crucial for choosing the right
technique under a specific low-resource scenario. To facilitate this
understanding, we provide a thorough structured overview of mainstream
techniques for low-resource DR. Based on their required resources, we divide
the techniques into three main categories: (1) only documents are needed; (2)
documents and questions are needed; and (3) documents and question-answer pairs
are needed. For every technique, we introduce its general-form algorithm,
highlight the open issues and pros and cons. Promising directions are outlined
for future research.
Related papers
- Open Domain Multi-document Summarization: A Comprehensive Study of Model
Brittleness under Retrieval [42.73076855699184]
Multi-document summarization (MDS) assumes a set of topic-related documents are provided as input.
We study this more challenging setting by formalizing the task and bootstrapping it using existing datasets, retrievers and summarizers.
arXiv Detail & Related papers (2022-12-20T18:41:38Z) - Questions Are All You Need to Train a Dense Passage Retriever [123.13872383489172]
ART is a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data.
It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question.
arXiv Detail & Related papers (2022-06-21T18:16:31Z) - The Use of NLP-Based Text Representation Techniques to Support
Requirement Engineering Tasks: A Systematic Mapping Review [1.5469452301122177]
The research direction has changed from the use of lexical and syntactic features to the use of advanced embedding techniques.
We identify four gaps in the existing literature, why they matter, and how future research can begin to address them.
arXiv Detail & Related papers (2022-05-17T02:47:26Z) - A Transfer Learning Pipeline for Educational Resource Discovery with
Application in Leading Paragraph Generation [71.92338855383238]
We propose a pipeline that automates web resource discovery for novel domains.
The pipeline achieves F1 scores of 0.94 and 0.82 when evaluated on two similar but novel target domains.
This is the first study that considers various web resources for survey generation.
arXiv Detail & Related papers (2022-01-07T03:35:40Z) - Adaptive Information Seeking for Open-Domain Question Answering [61.39330982757494]
We propose a novel adaptive information-seeking strategy for open-domain question answering, namely AISO.
According to the learned policy, AISO could adaptively select a proper retrieval action to seek the missing evidence at each step.
AISO outperforms all baseline methods with predefined strategies in terms of both retrieval and answer evaluations.
arXiv Detail & Related papers (2021-09-14T15:08:13Z) - Low-Resource Domain Adaptation for Compositional Task-Oriented Semantic
Parsing [85.35582118010608]
Task-oriented semantic parsing is a critical component of virtual assistants.
Recent advances in deep learning have enabled several approaches to successfully parse more complex queries.
We propose a novel method that outperforms a supervised neural model at a 10-fold data reduction.
arXiv Detail & Related papers (2020-10-07T17:47:53Z) - Tradeoffs in Sentence Selection Techniques for Open-Domain Question
Answering [54.541952928070344]
We describe two groups of models for sentence selection: QA-based approaches, which run a full-fledged QA system to identify answer candidates, and retrieval-based models, which find parts of each passage specifically related to each question.
We show that very lightweight QA models can do well at this task, but retrieval-based models are faster still.
arXiv Detail & Related papers (2020-09-18T23:39:15Z) - Extracting Topics from Open Educational Resources [0.0]
We propose an OER topic extraction approach, applying text mining techniques, to generate high-quality OER metadata about topic distribution.
This is done by: 1) collecting 123 lectures from Coursera and Khan Academy in the area of data science related skills, 2) applying Latent Dirichlet Allocation (LDA) on the collected resources in order to extract existing topics related to these skills, and 3) defining topic distributions covered by a particular OER.
arXiv Detail & Related papers (2020-06-19T12:50:55Z) - Pre-training Tasks for Embedding-based Large-scale Retrieval [68.01167604281578]
We consider the large-scale query-document retrieval problem.
Given a query (e.g., a question), return the set of relevant documents from a large document corpus.
We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks.
arXiv Detail & Related papers (2020-02-10T16:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.