Coarse-to-Fine Reasoning for Visual Question Answering
- URL: http://arxiv.org/abs/2110.02526v1
- Date: Wed, 6 Oct 2021 06:29:52 GMT
- Title: Coarse-to-Fine Reasoning for Visual Question Answering
- Authors: Binh X. Nguyen, Tuong Do, Huy Tran, Erman Tjiputra, Quang D. Tran, Anh
Nguyen
- Abstract summary: We present a new reasoning framework to fill the gap between visual features and semantic clues in the Visual Question Answering (VQA) task.
Our method first extracts the features and predicates from the image and question.
We then propose a new reasoning framework to effectively jointly learn these features and predicates in a coarse-to-fine manner.
- Score: 18.535633096397397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bridging the semantic gap between image and question is an important step to
improve the accuracy of the Visual Question Answering (VQA) task. However, most
of the existing VQA methods focus on attention mechanisms or visual relations
for reasoning the answer, while the features at different semantic levels are
not fully utilized. In this paper, we present a new reasoning framework to fill
the gap between visual features and semantic clues in the VQA task. Our method
first extracts the features and predicates from the image and question. We then
propose a new reasoning framework to effectively jointly learn these features
and predicates in a coarse-to-fine manner. The intensively experimental results
on three large-scale VQA datasets show that our proposed approach achieves
superior accuracy comparing with other state-of-the-art methods. Furthermore,
our reasoning framework also provides an explainable way to understand the
decision of the deep neural network when predicting the answer.
Related papers
- Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images [1.6932802756478726]
Visual Question Answering for Remote Sensing (RSVQA) is a task that aims at answering natural language questions about the content of a remote sensing image.
We propose to embed an attention mechanism guided by segmentation into a RSVQA pipeline.
We provide a new VQA dataset that exploits very high-resolution RGB orthophotos annotated with 16 segmentation classes and question/answer pairs.
arXiv Detail & Related papers (2024-07-11T16:59:32Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering [58.64831511644917]
We introduce an interpretable by design model that factors model decisions into intermediate human-legible explanations.
We show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions.
arXiv Detail & Related papers (2023-05-24T08:33:15Z) - REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
Question Answering [75.53187719777812]
This paper revisits visual representation in knowledge-based visual question answering (VQA)
We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions.
We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
arXiv Detail & Related papers (2022-06-02T17:59:56Z) - VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks
for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations.
Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context.
On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z) - Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception.
We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z) - Visual Question Answering with Prior Class Semantics [50.845003775809836]
We show how to exploit additional information pertaining to the semantics of candidate answers.
We extend the answer prediction process with a regression objective in a semantic space.
Our method brings improvements in consistency and accuracy over a range of question types.
arXiv Detail & Related papers (2020-05-04T02:46:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.