A Symmetric Dual Encoding Dense Retrieval Framework for
Knowledge-Intensive Visual Question Answering
- URL: http://arxiv.org/abs/2304.13649v1
- Date: Wed, 26 Apr 2023 16:14:39 GMT
- Title: A Symmetric Dual Encoding Dense Retrieval Framework for
Knowledge-Intensive Visual Question Answering
- Authors: Alireza Salemi, Juan Altmayer Pizzorno, Hamed Zamani
- Abstract summary: Knowledge-Intensive Visual Question Answering (KI-VQA) refers to answering a question about an image whose answer does not lie in the image.
This paper presents a new pipeline for KI-VQA tasks, consisting of a retriever and a reader.
- Score: 16.52970318866536
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Knowledge-Intensive Visual Question Answering (KI-VQA) refers to answering a
question about an image whose answer does not lie in the image. This paper
presents a new pipeline for KI-VQA tasks, consisting of a retriever and a
reader. First, we introduce DEDR, a symmetric dual encoding dense retrieval
framework in which documents and queries are encoded into a shared embedding
space using uni-modal (textual) and multi-modal encoders. We introduce an
iterative knowledge distillation approach that bridges the gap between the
representation spaces in these two encoders. Extensive evaluation on two
well-established KI-VQA datasets, i.e., OK-VQA and FVQA, suggests that DEDR
outperforms state-of-the-art baselines by 11.6% and 30.9% on OK-VQA and FVQA,
respectively. Utilizing the passages retrieved by DEDR, we further introduce
MM-FiD, an encoder-decoder multi-modal fusion-in-decoder model, for generating
a textual answer for KI-VQA tasks. MM-FiD encodes the question, the image, and
each retrieved passage separately and uses all passages jointly in its decoder.
Compared to competitive baselines in the literature, this approach leads to
5.5% and 8.5% improvements in terms of question answering accuracy on OK-VQA
and FVQA, respectively.
Related papers
- FastFiD: Improve Inference Efficiency of Open Domain Question Answering via Sentence Selection [61.9638234358049]
FastFiD is a novel approach that executes sentence selection on encoded passages.
This aids in retaining valuable sentences while reducing the context length required for generating answers.
arXiv Detail & Related papers (2024-08-12T17:50:02Z) - Multiple-Question Multiple-Answer Text-VQA [19.228969692887603]
Multiple-Question Multiple-Answer (MQMA) is a novel approach to do text-VQA in encoder-decoder transformer models.
MQMA takes multiple questions and content as input at the encoder and predicts multiple answers at the decoder in an auto-regressive manner.
We propose a novel MQMA denoising pre-training task which is designed to teach the model to align and delineate multiple questions and content with associated answers.
arXiv Detail & Related papers (2023-11-15T01:00:02Z) - Fine-grained Late-interaction Multi-modal Retrieval for Retrieval
Augmented Visual Question Answering [56.96857992123026]
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions.
This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA.
arXiv Detail & Related papers (2023-09-29T10:54:10Z) - Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual
Question Answering [16.52970318866536]
This paper studies a category of visual question answering tasks, in which accessing external knowledge is necessary for answering the questions.
A major step in developing OK-VQA systems is to retrieve relevant documents for the given multi-modal query.
We propose an automatic data generation pipeline for pre-training passage retrieval models for OK-VQA tasks.
arXiv Detail & Related papers (2023-06-28T18:06:40Z) - Exploring Dual Encoder Architectures for Question Answering [17.59582094233306]
Dual encoders have been used for question-answering (QA) and information retrieval (IR) tasks with good results.
There are two major types of dual encoders, Siamese Duals (SDE) and Asymmetric Dual architectures (ADE)
arXiv Detail & Related papers (2022-04-14T17:21:14Z) - KG-FiD: Infusing Knowledge Graph in Fusion-in-Decoder for Open-Domain
Question Answering [68.00631278030627]
We propose a novel method KG-FiD, which filters noisy passages by leveraging the structural relationship among the retrieved passages with a knowledge graph.
We show that KG-FiD can improve vanilla FiD by up to 1.5% on answer exact match score and achieve comparable performance with FiD with only 40% of computation cost.
arXiv Detail & Related papers (2021-10-08T18:39:59Z) - Question Answering Infused Pre-training of General-Purpose
Contextualized Representations [70.62967781515127]
We propose a pre-training objective based on question answering (QA) for learning general-purpose contextual representations.
We accomplish this goal by training a bi-encoder QA model, which independently encodes passages and questions, to match the predictions of a more accurate cross-encoder model.
We show large improvements over both RoBERTa-large and previous state-of-the-art results on zero-shot and few-shot paraphrase detection.
arXiv Detail & Related papers (2021-06-15T14:45:15Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z) - Differentiable Reasoning over a Virtual Knowledge Base [156.94984221342716]
We consider the task of answering complex multi-hop questions using a corpus as a virtual knowledge base (KB)
In particular, we describe a neural module, DrKIT, that traverses textual data like a KB, softly following paths of relations between mentions of entities in the corpus.
DrKIT is very efficient, processing 10-100x more queries per second than existing multi-hop systems.
arXiv Detail & Related papers (2020-02-25T03:13:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.