VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual
Question Answering
- URL: http://arxiv.org/abs/2109.13116v1
- Date: Mon, 27 Sep 2021 15:06:10 GMT
- Title: VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual
Question Answering
- Authors: Ekta Sood, Fabian K\"ogel, Florian Strohm, Prajit Dhar, Andreas
Bulling
- Abstract summary: We present VQA-MHUG - a novel dataset of multimodal human gaze on both images and questions during visual question answering (VQA)
We use our dataset to analyze the similarity between human and neural attentive strategies learned by five state-of-the-art VQA models.
- Score: 15.017443876780286
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present VQA-MHUG - a novel 49-participant dataset of multimodal human gaze
on both images and questions during visual question answering (VQA) collected
using a high-speed eye tracker. We use our dataset to analyze the similarity
between human and neural attentive strategies learned by five state-of-the-art
VQA models: Modular Co-Attention Network (MCAN) with either grid or region
features, Pythia, Bilinear Attention Network (BAN), and the Multimodal
Factorized Bilinear Pooling Network (MFB). While prior work has focused on
studying the image modality, our analyses show - for the first time - that for
all models, higher correlation with human attention on text is a significant
predictor of VQA performance. This finding points at a potential for improving
VQA performance and, at the same time, calls for further research on neural
text attention mechanisms and their integration into architectures for vision
and language tasks, including but potentially also beyond VQA.
Related papers
- VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception.
Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation.
We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment.
The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z) - ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese.
We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations.
We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z) - From Pixels to Objects: Cubic Visual Attention for Visual Question
Answering [132.95819467484517]
Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to target different visual areas that are related to the answer.
We propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task.
Experimental results show that our proposed method significantly outperforms the state-of-the-arts.
arXiv Detail & Related papers (2022-06-04T07:03:18Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z) - Achieving Human Parity on Visual Question Answering [67.22500027651509]
The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image.
This paper describes our recent research of AliceMind-MMU that obtains similar or even slightly better results than human beings does on VQA.
This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task.
arXiv Detail & Related papers (2021-11-17T04:25:11Z) - Multimodal Integration of Human-Like Attention in Visual Question
Answering [13.85096308757021]
We present the Multimodal Human-like Attention Network (MULAN)
MULAN is the first method for multimodal integration of human-like attention on image and text during training of VQA models.
We show that MULAN achieves a new state-of-the-art performance of 73.98% accuracy on test-std and 73.72% on test-dev.
arXiv Detail & Related papers (2021-09-27T15:56:54Z) - A survey on VQA_Datasets and Approaches [0.0]
Visual question answering (VQA) is a task that combines the techniques of computer vision and natural language processing.
This paper will review and analyze existing datasets, metrics, and models proposed for the VQA task.
arXiv Detail & Related papers (2021-05-02T08:50:30Z) - Probabilistic Graph Attention Network with Conditional Kernels for
Pixel-Wise Prediction [158.88345945211185]
We present a novel approach that advances the state of the art on pixel-level prediction in a fundamental aspect, i.e. structured multi-scale features learning and fusion.
We propose a probabilistic graph attention network structure based on a novel Attention-Gated Conditional Random Fields (AG-CRFs) model for learning and fusing multi-scale representations in a principled manner.
arXiv Detail & Related papers (2021-01-08T04:14:29Z) - Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z) - Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models [39.338304913058685]
We study the trade-off between the model complexity and the performance on the Visual Question Answering task.
We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
arXiv Detail & Related papers (2020-01-20T11:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.