What's Different between Visual Question Answering for Machine
"Understanding" Versus for Accessibility?
- URL: http://arxiv.org/abs/2210.14966v1
- Date: Wed, 26 Oct 2022 18:23:53 GMT
- Title: What's Different between Visual Question Answering for Machine
"Understanding" Versus for Accessibility?
- Authors: Yang Trista Cao, Kyle Seelman, Kyungjun Lee, Hal Daum\'e III
- Abstract summary: In visual question answering (VQA), a machine must answer a question given an associated image.
We evaluate discrepancies between machine "understanding" datasets (VQA-v2) and accessibility datasets (VizWiz) by evaluating a variety of VQA models.
Based on our findings, we discuss opportunities and challenges in VQA for accessibility and suggest directions for future work.
- Score: 8.373151777137792
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In visual question answering (VQA), a machine must answer a question given an
associated image. Recently, accessibility researchers have explored whether VQA
can be deployed in a real-world setting where users with visual impairments
learn about their environment by capturing their visual surroundings and asking
questions. However, most of the existing benchmarking datasets for VQA focus on
machine "understanding" and it remains unclear how progress on those datasets
corresponds to improvements in this real-world use case. We aim to answer this
question by evaluating discrepancies between machine "understanding" datasets
(VQA-v2) and accessibility datasets (VizWiz) by evaluating a variety of VQA
models. Based on our findings, we discuss opportunities and challenges in VQA
for accessibility and suggest directions for future work.
Related papers
- Fully Authentic Visual Question Answering Dataset from Online Communities [72.0524198499719]
Visual Question Answering (VQA) entails answering questions about images.
We introduce the first VQA dataset in which all contents originate from an authentic use case.
We characterize this dataset and how it relates to eight mainstream VQA datasets.
arXiv Detail & Related papers (2023-11-27T06:19:00Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - Grounding Answers for Visual Questions Asked by Visually Impaired People [16.978747012406266]
VizWiz-VQA-Grounding is the first dataset that visually grounds answers to visual questions asked by people with visual impairments.
We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different.
arXiv Detail & Related papers (2022-02-04T06:47:16Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Found a Reason for me? Weakly-supervised Grounded Visual Question
Answering using Capsules [85.98177341704675]
The problem of grounding VQA tasks has seen an increased attention in the research community recently.
We propose a visual capsule module with a query-based selection mechanism of capsule features.
We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
arXiv Detail & Related papers (2021-05-11T07:45:32Z) - A survey on VQA_Datasets and Approaches [0.0]
Visual question answering (VQA) is a task that combines the techniques of computer vision and natural language processing.
This paper will review and analyze existing datasets, metrics, and models proposed for the VQA task.
arXiv Detail & Related papers (2021-05-02T08:50:30Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - IQ-VQA: Intelligent Visual Question Answering [3.09911862091928]
We show that our framework improves consistency of VQA models by 15% on the rule-based dataset.
We also quantitatively show improvement in attention maps which highlights better multi-modal understanding of vision and language.
arXiv Detail & Related papers (2020-07-08T20:41:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.