An experimental study of the vision-bottleneck in VQA
- URL: http://arxiv.org/abs/2202.06858v1
- Date: Mon, 14 Feb 2022 16:43:32 GMT
- Title: An experimental study of the vision-bottleneck in VQA
- Authors: Pierre Marza, Corentin Kervadec, Grigory Antipov, Moez Baccouche,
Christian Wolf
- Abstract summary: We study the vision-bottleneck in Visual Question Answering (VQA)
We experiment with both the quantity and quality of visual objects extracted from images.
We also study the impact of two methods to incorporate the information about objects necessary for answering a question.
- Score: 17.132865538874352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As in many tasks combining vision and language, both modalities play a
crucial role in Visual Question Answering (VQA). To properly solve the task, a
given model should both understand the content of the proposed image and the
nature of the question. While the fusion between modalities, which is another
obviously important part of the problem, has been highly studied, the vision
part has received less attention in recent work. Current state-of-the-art
methods for VQA mainly rely on off-the-shelf object detectors delivering a set
of object bounding boxes and embeddings, which are then combined with question
word embeddings through a reasoning module. In this paper, we propose an
in-depth study of the vision-bottleneck in VQA, experimenting with both the
quantity and quality of visual objects extracted from images. We also study the
impact of two methods to incorporate the information about objects necessary
for answering a question, in the reasoning module directly, and earlier in the
object selection stage. This work highlights the importance of vision in the
context of VQA, and the interest of tailoring vision methods used in VQA to the
task at hand.
Related papers
- VQA$^2$:Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment originally focused on quantitative video quality scoring.
It is now evolving towards more comprehensive visual quality understanding tasks.
We introduce the first visual question answering instruction dataset entirely focuses on video quality assessment.
We conduct extensive experiments on both video quality scoring and video quality understanding tasks.
arXiv Detail & Related papers (2024-11-06T09:39:52Z) - From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities [2.0681376988193843]
The work presents a survey in the domain of Visual Question Answering (VQA) that delves into the intricacies of VQA datasets and methods over the field's history.
We further generalize VQA to multimodal question answering, explore tasks related to VQA, and present a set of open problems for future investigation.
arXiv Detail & Related papers (2023-11-01T05:39:41Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - VQA with Cascade of Self- and Co-Attention Blocks [3.0013352260516744]
This work aims to learn an improved multi-modal representation through dense interaction of visual and textual modalities.
The proposed model has an attention block containing both self-attention and co-attention on image and text.
arXiv Detail & Related papers (2023-02-28T17:20:40Z) - REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
Question Answering [75.53187719777812]
This paper revisits visual representation in knowledge-based visual question answering (VQA)
We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions.
We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
arXiv Detail & Related papers (2022-06-02T17:59:56Z) - Achieving Human Parity on Visual Question Answering [67.22500027651509]
The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image.
This paper describes our recent research of AliceMind-MMU that obtains similar or even slightly better results than human beings does on VQA.
This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task.
arXiv Detail & Related papers (2021-11-17T04:25:11Z) - Coarse-to-Fine Reasoning for Visual Question Answering [18.535633096397397]
We present a new reasoning framework to fill the gap between visual features and semantic clues in the Visual Question Answering (VQA) task.
Our method first extracts the features and predicates from the image and question.
We then propose a new reasoning framework to effectively jointly learn these features and predicates in a coarse-to-fine manner.
arXiv Detail & Related papers (2021-10-06T06:29:52Z) - Found a Reason for me? Weakly-supervised Grounded Visual Question
Answering using Capsules [85.98177341704675]
The problem of grounding VQA tasks has seen an increased attention in the research community recently.
We propose a visual capsule module with a query-based selection mechanism of capsule features.
We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
arXiv Detail & Related papers (2021-05-11T07:45:32Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception.
We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.