Multimodal Integration of Human-Like Attention in Visual Question
Answering
- URL: http://arxiv.org/abs/2109.13139v1
- Date: Mon, 27 Sep 2021 15:56:54 GMT
- Title: Multimodal Integration of Human-Like Attention in Visual Question
Answering
- Authors: Ekta Sood, Fabian K\"ogel, Philipp M\"uller, Dominike Thomas, Mihai
Bace, Andreas Bulling
- Abstract summary: We present the Multimodal Human-like Attention Network (MULAN)
MULAN is the first method for multimodal integration of human-like attention on image and text during training of VQA models.
We show that MULAN achieves a new state-of-the-art performance of 73.98% accuracy on test-std and 73.72% on test-dev.
- Score: 13.85096308757021
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Human-like attention as a supervisory signal to guide neural attention has
shown significant promise but is currently limited to uni-modal integration -
even for inherently multimodal tasks such as visual question answering (VQA).
We present the Multimodal Human-like Attention Network (MULAN) - the first
method for multimodal integration of human-like attention on image and text
during training of VQA models. MULAN integrates attention predictions from two
state-of-the-art text and image saliency models into neural self-attention
layers of a recent transformer-based VQA model. Through evaluations on the
challenging VQAv2 dataset, we show that MULAN achieves a new state-of-the-art
performance of 73.98% accuracy on test-std and 73.72% on test-dev and, at the
same time, has approximately 80% fewer trainable parameters than prior work.
Overall, our work underlines the potential of integrating multimodal human-like
and neural attention for VQA
Related papers
- Advancing Vietnamese Visual Question Answering with Transformer and Convolutional Integration [0.40964539027092917]
This study aims to bridge the gap by conducting experiments on the Vietnamese Visual Question Answering dataset.
We have developed a model that enhances image representation capabilities, thereby improving overall performance in the ViVQA system.
Our experimental findings demonstrate that our model surpasses competing baselines, achieving promising performance.
arXiv Detail & Related papers (2024-07-30T22:32:50Z) - Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics [54.08757792080732]
We propose integrating deep features from pre-trained visual models with a statistical analysis model to achieve opinion-unaware BIQA (OU-BIQA)
Our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models.
arXiv Detail & Related papers (2024-05-29T06:09:34Z) - Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration.
In this work, we tackle the task of reconstructing closely interactive humans from a monocular video.
We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - Assessor360: Multi-sequence Network for Blind Omnidirectional Image
Quality Assessment [50.82681686110528]
Blind Omnidirectional Image Quality Assessment (BOIQA) aims to objectively assess the human perceptual quality of omnidirectional images (ODIs)
The quality assessment of ODIs is severely hampered by the fact that the existing BOIQA pipeline lacks the modeling of the observer's browsing process.
We propose a novel multi-sequence network for BOIQA called Assessor360, which is derived from the realistic multi-assessor ODI quality assessment procedure.
arXiv Detail & Related papers (2023-05-18T13:55:28Z) - Correlation Information Bottleneck: Towards Adapting Pretrained
Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations.
We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - Multimodal End-to-End Group Emotion Recognition using Cross-Modal
Attention [0.0]
Classifying group-level emotions is a challenging task due to complexity of video.
Our model achieves best validation accuracy of 60.37% which is approximately 8.5% higher, than VGAF dataset baseline.
arXiv Detail & Related papers (2021-11-10T19:19:26Z) - VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual
Question Answering [15.017443876780286]
We present VQA-MHUG - a novel dataset of multimodal human gaze on both images and questions during visual question answering (VQA)
We use our dataset to analyze the similarity between human and neural attentive strategies learned by five state-of-the-art VQA models.
arXiv Detail & Related papers (2021-09-27T15:06:10Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z) - Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models [39.338304913058685]
We study the trade-off between the model complexity and the performance on the Visual Question Answering task.
We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
arXiv Detail & Related papers (2020-01-20T11:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.