Answer Mining from a Pool of Images: Towards Retrieval-Based Visual
Question Answering
- URL: http://arxiv.org/abs/2306.16713v1
- Date: Thu, 29 Jun 2023 06:22:43 GMT
- Title: Answer Mining from a Pool of Images: Towards Retrieval-Based Visual
Question Answering
- Authors: Abhirama Subramanyam Penamakuri, Manish Gupta, Mithun Das Gupta, Anand
Mishra
- Abstract summary: We study visual question answering in a setting where the answer has to be mined from a pool of relevant and irrelevant images given as a context.
We propose a unified Multi Image BART (MI-BART) that takes a question and retrieved images using our relevance encoder for free-form fluent answer generation.
Our proposed framework achieves an accuracy of 76.5% and a fluency of 79.3% on the proposed dataset, namely RETVQA.
- Score: 7.3532068640624395
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We study visual question answering in a setting where the answer has to be
mined from a pool of relevant and irrelevant images given as a context. For
such a setting, a model must first retrieve relevant images from the pool and
answer the question from these retrieved images. We refer to this problem as
retrieval-based visual question answering (or RETVQA in short). The RETVQA is
distinctively different and more challenging than the traditionally-studied
Visual Question Answering (VQA), where a given question has to be answered with
a single relevant image in context. Towards solving the RETVQA task, we propose
a unified Multi Image BART (MI-BART) that takes a question and retrieved images
using our relevance encoder for free-form fluent answer generation. Further, we
introduce the largest dataset in this space, namely RETVQA, which has the
following salient features: multi-image and retrieval requirement for VQA,
metadata-independent questions over a pool of heterogeneous images, expecting a
mix of classification-oriented and open-ended generative answers. Our proposed
framework achieves an accuracy of 76.5% and a fluency of 79.3% on the proposed
dataset, namely RETVQA and also outperforms state-of-the-art methods by 4.9%
and 11.8% on the image segment of the publicly available WebQA dataset on the
accuracy and fluency metrics, respectively.
Related papers
- VQA4CIR: Boosting Composed Image Retrieval with Visual Question
Answering [68.47402250389685]
This work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR.
The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods.
Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.
arXiv Detail & Related papers (2023-12-19T15:56:08Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases.
We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity)
Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z) - Fine-grained Late-interaction Multi-modal Retrieval for Retrieval
Augmented Visual Question Answering [56.96857992123026]
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions.
This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA.
arXiv Detail & Related papers (2023-09-29T10:54:10Z) - ChiQA: A Large Scale Image-based Real-World Question Answering Dataset
for Multi-Modal Understanding [42.5118058527339]
ChiQA contains more than 40K questions and more than 200K question-images pairs.
ChiQA requires a deep understanding of both language and vision, including grounding, comparisons, and reading.
We evaluate several state-of-the-art visual-language models such as ALBEF, demonstrating that there is still a large room for improvements on ChiQA.
arXiv Detail & Related papers (2022-08-05T07:55:28Z) - Answer-checking in Context: A Multi-modal FullyAttention Network for
Visual Question Answering [8.582218033859087]
We propose a fully attention based Visual Question Answering architecture.
An answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation.
Our model achieves the state-of-the-art accuracy 71.57% using fewer parameters on VQA-v2.0 test-standard split.
arXiv Detail & Related papers (2020-10-17T03:37:16Z) - Visual Question Answering on Image Sets [70.4472272672716]
We introduce the task of Image-Set Visual Question Answering (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings.
Taking a natural language question and a set of images as input, it aims to answer the question based on the content of the images.
The questions can be about objects and relationships in one or more images or about the entire scene depicted by the image set.
arXiv Detail & Related papers (2020-08-27T08:03:32Z) - REXUP: I REason, I EXtract, I UPdate with Structured Compositional
Reasoning for Visual Question Answering [4.02726934790798]
We propose a deep reasoning VQA model with explicit visual structure-aware textual information.
REXUP network consists of two branches, image object-oriented and scene graph oriented, which jointly works with super-diagonal fusion compositional attention network.
Our best model significantly outperforms the precious state-of-the-art, which delivers 92.7% on the validation set and 73.1% on the test-dev set.
arXiv Detail & Related papers (2020-07-27T00:54:50Z) - C3VQG: Category Consistent Cyclic Visual Question Generation [51.339348810676896]
Visual Question Generation (VQG) is the task of generating natural questions based on an image.
In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers.
Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations.
arXiv Detail & Related papers (2020-05-15T20:25:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.