Visual Question Answering on Image Sets
- URL: http://arxiv.org/abs/2008.11976v1
- Date: Thu, 27 Aug 2020 08:03:32 GMT
- Title: Visual Question Answering on Image Sets
- Authors: Ankan Bansal, Yuting Zhang, Rama Chellappa
- Abstract summary: We introduce the task of Image-Set Visual Question Answering (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings.
Taking a natural language question and a set of images as input, it aims to answer the question based on the content of the images.
The questions can be about objects and relationships in one or more images or about the entire scene depicted by the image set.
- Score: 70.4472272672716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the task of Image-Set Visual Question Answering (ISVQA), which
generalizes the commonly studied single-image VQA problem to multi-image
settings. Taking a natural language question and a set of images as input, it
aims to answer the question based on the content of the images. The questions
can be about objects and relationships in one or more images or about the
entire scene depicted by the image set. To enable research in this new topic,
we introduce two ISVQA datasets - indoor and outdoor scenes. They simulate the
real-world scenarios of indoor image collections and multiple car-mounted
cameras, respectively. The indoor-scene dataset contains 91,479 human annotated
questions for 48,138 image sets, and the outdoor-scene dataset has 49,617
questions for 12,746 image sets. We analyze the properties of the two datasets,
including question-and-answer distributions, types of questions, biases in
dataset, and question-image dependencies. We also build new baseline models to
investigate new research challenges in ISVQA.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - StackOverflowVQA: Stack Overflow Visual Question Answering Dataset [0.04096453902709291]
This work focuses on the questions which need the understanding of images in addition to the question itself.
We introduce the StackOverflowVQA dataset, which includes questions from StackOverflow that have one or more accompanying images.
arXiv Detail & Related papers (2024-05-17T12:30:23Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Toloka Visual Question Answering Benchmark [7.71562336736357]
Toloka Visual Question Answering is a new crowdsourced dataset allowing comparing performance of machine learning systems against human level of expertise in the grounding visual question answering task.
Our dataset contains 45,199 pairs of images and questions in English, provided with ground truth bounding boxes, split into train and two test subsets.
arXiv Detail & Related papers (2023-09-28T15:18:35Z) - Answer Mining from a Pool of Images: Towards Retrieval-Based Visual
Question Answering [7.3532068640624395]
We study visual question answering in a setting where the answer has to be mined from a pool of relevant and irrelevant images given as a context.
We propose a unified Multi Image BART (MI-BART) that takes a question and retrieved images using our relevance encoder for free-form fluent answer generation.
Our proposed framework achieves an accuracy of 76.5% and a fluency of 79.3% on the proposed dataset, namely RETVQA.
arXiv Detail & Related papers (2023-06-29T06:22:43Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - ChiQA: A Large Scale Image-based Real-World Question Answering Dataset
for Multi-Modal Understanding [42.5118058527339]
ChiQA contains more than 40K questions and more than 200K question-images pairs.
ChiQA requires a deep understanding of both language and vision, including grounding, comparisons, and reading.
We evaluate several state-of-the-art visual-language models such as ALBEF, demonstrating that there is still a large room for improvements on ChiQA.
arXiv Detail & Related papers (2022-08-05T07:55:28Z) - Grounding Answers for Visual Questions Asked by Visually Impaired People [16.978747012406266]
VizWiz-VQA-Grounding is the first dataset that visually grounds answers to visual questions asked by people with visual impairments.
We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different.
arXiv Detail & Related papers (2022-02-04T06:47:16Z) - CapWAP: Captioning with a Purpose [56.99405135645775]
We propose a new task, Captioning with a Purpose (CapWAP)
Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population.
We show that it is possible to use reinforcement learning to directly optimize for the intended information need.
arXiv Detail & Related papers (2020-11-09T09:23:55Z) - Visual Question Answering on 360{\deg} Images [96.00046925811515]
VQA 360 is a novel task of visual question answering on 360 images.
We collect the first VQA 360 dataset, containing around 17,000 real-world image-question-answer triplets for a variety of question types.
arXiv Detail & Related papers (2020-01-10T08:18:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.