Grounding Answers for Visual Questions Asked by Visually Impaired People
- URL: http://arxiv.org/abs/2202.01993v1
- Date: Fri, 4 Feb 2022 06:47:16 GMT
- Title: Grounding Answers for Visual Questions Asked by Visually Impaired People
- Authors: Chongyan Chen, Samreen Anjum, Danna Gurari
- Abstract summary: VizWiz-VQA-Grounding is the first dataset that visually grounds answers to visual questions asked by people with visual impairments.
We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different.
- Score: 16.978747012406266
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual question answering is the task of answering questions about images. We
introduce the VizWiz-VQA-Grounding dataset, the first dataset that visually
grounds answers to visual questions asked by people with visual impairments. We
analyze our dataset and compare it with five VQA-Grounding datasets to
demonstrate what makes it similar and different. We then evaluate the SOTA VQA
and VQA-Grounding models and demonstrate that current SOTA algorithms often
fail to identify the correct visual evidence where the answer is located. These
models regularly struggle when the visual evidence occupies a small fraction of
the image, for images that are higher quality, as well as for visual questions
that require skills in text recognition. The dataset, evaluation server, and
leaderboard all can be found at the following link:
https://vizwiz.org/tasks-and-datasets/answer-grounding-for-vqa/.
Related papers
- Fully Authentic Visual Question Answering Dataset from Online Communities [72.0524198499719]
Visual Question Answering (VQA) entails answering questions about images.
We introduce the first VQA dataset in which all contents originate from an authentic use case.
We characterize this dataset and how it relates to eight mainstream VQA datasets.
arXiv Detail & Related papers (2023-11-27T06:19:00Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - What's Different between Visual Question Answering for Machine
"Understanding" Versus for Accessibility? [8.373151777137792]
In visual question answering (VQA), a machine must answer a question given an associated image.
We evaluate discrepancies between machine "understanding" datasets (VQA-v2) and accessibility datasets (VizWiz) by evaluating a variety of VQA models.
Based on our findings, we discuss opportunities and challenges in VQA for accessibility and suggest directions for future work.
arXiv Detail & Related papers (2022-10-26T18:23:53Z) - ChiQA: A Large Scale Image-based Real-World Question Answering Dataset
for Multi-Modal Understanding [42.5118058527339]
ChiQA contains more than 40K questions and more than 200K question-images pairs.
ChiQA requires a deep understanding of both language and vision, including grounding, comparisons, and reading.
We evaluate several state-of-the-art visual-language models such as ALBEF, demonstrating that there is still a large room for improvements on ChiQA.
arXiv Detail & Related papers (2022-08-05T07:55:28Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question
Answering with Hypothetical Actions over Images [31.317663183139384]
We take visual understanding to a higher level where systems are challenged to answer questions that involve mentally simulating the hypothetical consequences of performing specific actions in a given scenario.
We formulate a vision-language question answering task based on the CLEVR dataset.
arXiv Detail & Related papers (2021-04-13T07:29:21Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Visual Question Answering on Image Sets [70.4472272672716]
We introduce the task of Image-Set Visual Question Answering (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings.
Taking a natural language question and a set of images as input, it aims to answer the question based on the content of the images.
The questions can be about objects and relationships in one or more images or about the entire scene depicted by the image set.
arXiv Detail & Related papers (2020-08-27T08:03:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.