LOIS: Looking Out of Instance Semantics for Visual Question Answering
- URL: http://arxiv.org/abs/2307.14142v1
- Date: Wed, 26 Jul 2023 12:13:00 GMT
- Title: LOIS: Looking Out of Instance Semantics for Visual Question Answering
- Authors: Siyu Zhang, Yeming Chen, Yaoru Sun, Fang Wang, Haibo Shi, Haoran Wang
- Abstract summary: We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
- Score: 17.076621453814926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual question answering (VQA) has been intensively studied as a multimodal
task that requires effort in bridging vision and language to infer answers
correctly. Recent attempts have developed various attention-based modules for
solving VQA tasks. However, the performance of model inference is largely
bottlenecked by visual processing for semantics understanding. Most existing
detection methods rely on bounding boxes, remaining a serious challenge for VQA
models to understand the causal nexus of object semantics in images and
correctly infer contextual information. To this end, we propose a finer model
framework without bounding boxes in this work, termed Looking Out of Instance
Semantics (LOIS) to tackle this important issue. LOIS enables more fine-grained
feature descriptions to produce visual facts. Furthermore, to overcome the
label ambiguity caused by instance masks, two types of relation attention
modules: 1) intra-modality and 2) inter-modality, are devised to infer the
correct answers from the different multi-view features. Specifically, we
implement a mutual relation attention module to model sophisticated and deeper
visual semantic relations between instance objects and background information.
In addition, our proposed attention model can further analyze salient image
regions by focusing on important word-related questions. Experimental results
on four benchmark VQA datasets prove that our proposed method has favorable
performance in improving visual reasoning capability.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Visual Commonsense based Heterogeneous Graph Contrastive Learning [79.22206720896664]
We propose a heterogeneous graph contrastive learning method to better finish the visual reasoning task.
Our method is designed as a plug-and-play way, so that it can be quickly and easily combined with a wide range of representative methods.
arXiv Detail & Related papers (2023-11-11T12:01:18Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - VQA with Cascade of Self- and Co-Attention Blocks [3.0013352260516744]
This work aims to learn an improved multi-modal representation through dense interaction of visual and textual modalities.
The proposed model has an attention block containing both self-attention and co-attention on image and text.
arXiv Detail & Related papers (2023-02-28T17:20:40Z) - Cross-Modal Causal Relational Reasoning for Event-Level Visual Question
Answering [134.91774666260338]
Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes.
We propose a framework for cross-modal causal relational reasoning to address the task of event-level visual question answering.
arXiv Detail & Related papers (2022-07-26T04:25:54Z) - Visual Perturbation-aware Collaborative Learning for Overcoming the
Language Prior Problem [60.0878532426877]
We propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration.
Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents.
The experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness.
arXiv Detail & Related papers (2022-07-24T23:50:52Z) - An experimental study of the vision-bottleneck in VQA [17.132865538874352]
We study the vision-bottleneck in Visual Question Answering (VQA)
We experiment with both the quantity and quality of visual objects extracted from images.
We also study the impact of two methods to incorporate the information about objects necessary for answering a question.
arXiv Detail & Related papers (2022-02-14T16:43:32Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Dependent Multi-Task Learning with Causal Intervention for Image
Captioning [10.6405791176668]
In this paper, we propose a dependent multi-task learning framework with the causal intervention (DMTCI)
Firstly, we involve an intermediate task, bag-of-categories generation, before the final task, image captioning.
Secondly, we apply Pearl's do-calculus on the model, cutting off the link between the visual features and possible confounders.
Finally, we use a multi-agent reinforcement learning strategy to enable end-to-end training and reduce the inter-task error accumulations.
arXiv Detail & Related papers (2021-05-18T14:57:33Z) - Multi-View Attention Network for Visual Dialog [5.731758300670842]
It is necessary for an agent to 1) determine the semantic intent of question and 2) align question-relevant textual and visual contents.
We propose Multi-View Attention Network (MVAN), which leverages multiple views about heterogeneous inputs.
MVAN effectively captures the question-relevant information from the dialog history with two complementary modules.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.