Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory
- URL: http://arxiv.org/abs/2107.01671v4
- Date: Thu, 7 Dec 2023 23:22:52 GMT
- Title: Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory
- Authors: Xuejiao Tang, Xin Huang, Wenbin Zhang, Travers B. Child, Qiong Hu,
Zhen Liu and Ji Zhang
- Abstract summary: Visual Commonsense Reasoning (VCR) predicts an answer with corresponding rationale, given a question-image input.
Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models.
We propose a dynamic working memory based cognitive VCR network, which stores accumulated commonsense between sentences to provide prior knowledge for inference.
- Score: 10.544312410674985
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Visual Commonsense Reasoning (VCR) predicts an answer with corresponding
rationale, given a question-image input. VCR is a recently introduced visual
scene understanding task with a wide range of applications, including visual
question answering, automated vehicle systems, and clinical decision support.
Previous approaches to solving the VCR task generally rely on pre-training or
exploiting memory with long dependency relationship encoded models. However,
these approaches suffer from a lack of generalizability and prior knowledge. In
this paper we propose a dynamic working memory based cognitive VCR network,
which stores accumulated commonsense between sentences to provide prior
knowledge for inference. Extensive experiments show that the proposed model
yields significant improvements over existing methods on the benchmark VCR
dataset. Moreover, the proposed model provides intuitive interpretation into
visual commonsense reasoning. A Python implementation of our mechanism is
publicly available at https://github.com/tanjatang/DMVCR
Related papers
- Retrieval-Augmented Natural Language Reasoning for Explainable Visual Question Answering [2.98667511228225]
ReRe is an encoder-decoder architecture model using a pre-trained clip vision encoder and a pre-trained GPT-2 language model as a decoder.
ReRe outperforms previous methods in VQA accuracy and explanation score and shows improvement in NLE with more persuasive, reliability.
arXiv Detail & Related papers (2024-08-30T04:39:43Z) - Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR [51.72751335574947]
Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes.
Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers)
This paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR.
arXiv Detail & Related papers (2024-05-27T08:26:58Z) - A Memory Model for Question Answering from Streaming Data Supported by
Rehearsal and Anticipation of Coreference Information [19.559853775982386]
We propose a memory model that performs rehearsal and anticipation while processing inputs to important information for solving question answering tasks from streaming data.
We validate our model on a short-sequence (bAbI) dataset as well as large-sequence textual (NarrativeQA) and video (ActivityNet-QA) question answering datasets.
arXiv Detail & Related papers (2023-05-12T15:46:36Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z) - Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning.
Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z) - Sparse Visual Counterfactual Explanations in Image Space [50.768119964318494]
We present a novel model for visual counterfactual explanations in image space.
We show that it can be used to detect undesired behavior of ImageNet classifiers due to spurious features in the ImageNet dataset.
arXiv Detail & Related papers (2022-05-16T20:23:11Z) - Attention Mechanism based Cognition-level Scene Understanding [23.592893555879538]
The Visual Commonsense Reasoning (VCR) model can predict an answer with the corresponding rationale, which requires inference ability from the real world.
Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models.
We propose a parallel attention-based cognitive VCR network PAVCR, which fuses visual-textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference.
arXiv Detail & Related papers (2022-04-17T15:04:44Z) - Joint Answering and Explanation for Visual Commonsense Reasoning [46.44588492897933]
Visual Commonsense Reasoning endeavors to pursue a more high-level visual comprehension.
It is composed of two indispensable processes: question answering over a given image and rationale inference for answer explanation.
We present a plug-and-play knowledge distillation enhanced framework to couple the question answering and rationale inference processes.
arXiv Detail & Related papers (2022-02-25T11:26:52Z) - Relation-aware Hierarchical Attention Framework for Video Question
Answering [6.312182279855817]
We propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos.
In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features.
We consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer.
arXiv Detail & Related papers (2021-05-13T09:35:42Z) - Visual Commonsense R-CNN [102.5061122013483]
We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN)
VC R-CNN serves as an improved visual region encoder for high-level tasks such as captioning and VQA.
We extensively apply VC R-CNN features in prevailing models of three popular tasks: Image Captioning, VQA, and VCR, and observe consistent performance boosts across them.
arXiv Detail & Related papers (2020-02-27T15:51:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.