Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual
Question Answering
- URL: http://arxiv.org/abs/2306.16478v1
- Date: Wed, 28 Jun 2023 18:06:40 GMT
- Title: Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual
Question Answering
- Authors: Alireza Salemi, Mahta Rafiee, Hamed Zamani
- Abstract summary: This paper studies a category of visual question answering tasks, in which accessing external knowledge is necessary for answering the questions.
A major step in developing OK-VQA systems is to retrieve relevant documents for the given multi-modal query.
We propose an automatic data generation pipeline for pre-training passage retrieval models for OK-VQA tasks.
- Score: 16.52970318866536
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper studies a category of visual question answering tasks, in which
accessing external knowledge is necessary for answering the questions. This
category is called outside-knowledge visual question answering (OK-VQA). A
major step in developing OK-VQA systems is to retrieve relevant documents for
the given multi-modal query. Current state-of-the-art asymmetric dense
retrieval model for this task uses an architecture with a multi-modal query
encoder and a uni-modal document encoder. Such an architecture requires a large
amount of training data for effective performance. We propose an automatic data
generation pipeline for pre-training passage retrieval models for OK-VQA tasks.
The proposed approach leads to 26.9% Precision@5 improvements compared to the
current state-of-the-art asymmetric architecture. Additionally, the proposed
pre-training approach exhibits a good ability in zero-shot retrieval scenarios.
Related papers
- End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - Multimodal Inverse Cloze Task for Knowledge-based Visual Question
Answering [4.114444605090133]
We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities.
KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base.
Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading comprehension.
arXiv Detail & Related papers (2023-01-11T09:16:34Z) - Retrieval as Attention: End-to-end Learning of Retrieval and Reading
within a Single Transformer [80.50327229467993]
We show that a single model trained end-to-end can achieve both competitive retrieval and QA performance.
We show that end-to-end adaptation significantly boosts its performance on out-of-domain datasets in both supervised and unsupervised settings.
arXiv Detail & Related papers (2022-12-05T04:51:21Z) - UniKGQA: Unified Retrieval and Reasoning for Solving Multi-hop Question
Answering Over Knowledge Graph [89.98762327725112]
Multi-hop Question Answering over Knowledge Graph(KGQA) aims to find the answer entities that are multiple hops away from the topic entities mentioned in a natural language question.
We propose UniKGQA, a novel approach for multi-hop KGQA task, by unifying retrieval and reasoning in both model architecture and parameter learning.
arXiv Detail & Related papers (2022-12-02T04:08:09Z) - Retrieval Augmented Visual Question Answering with Outside Knowledge [14.371342370460685]
Outside-Knowledge Visual Question Answering (OK-VQA) is a challenging VQA task that requires retrieval of external knowledge to answer questions about images.
Recent OK-VQA systems use Dense Passage Retrieval (DPR) to retrieve documents from external knowledge bases, such as Wikipedia, but with DPR trained separately from answer generation.
We propose a joint training scheme which includes differentiable DPR integrated with answer generation so that the system can be trained in an end-to-end fashion.
arXiv Detail & Related papers (2022-10-07T20:35:58Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Questions Are All You Need to Train a Dense Passage Retriever [123.13872383489172]
ART is a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data.
It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question.
arXiv Detail & Related papers (2022-06-21T18:16:31Z) - Tradeoffs in Sentence Selection Techniques for Open-Domain Question
Answering [54.541952928070344]
We describe two groups of models for sentence selection: QA-based approaches, which run a full-fledged QA system to identify answer candidates, and retrieval-based models, which find parts of each passage specifically related to each question.
We show that very lightweight QA models can do well at this task, but retrieval-based models are faster still.
arXiv Detail & Related papers (2020-09-18T23:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.