Attention Mechanism based Cognition-level Scene Understanding
- URL: http://arxiv.org/abs/2204.08027v2
- Date: Tue, 19 Apr 2022 02:40:42 GMT
- Title: Attention Mechanism based Cognition-level Scene Understanding
- Authors: Xuejiao Tang, Tai Le Quy, Eirini Ntoutsi, Kea Turner, Vasile Palade,
Israat Haque, Peng Xu, Chris Brown and Wenbin Zhang
- Abstract summary: The Visual Commonsense Reasoning (VCR) model can predict an answer with the corresponding rationale, which requires inference ability from the real world.
Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models.
We propose a parallel attention-based cognitive VCR network PAVCR, which fuses visual-textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference.
- Score: 23.592893555879538
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Given a question-image input, the Visual Commonsense Reasoning (VCR) model
can predict an answer with the corresponding rationale, which requires
inference ability from the real world. The VCR task, which calls for exploiting
the multi-source information as well as learning different levels of
understanding and extensive commonsense knowledge, is a cognition-level scene
understanding task. The VCR task has aroused researchers' interest due to its
wide range of applications, including visual question answering, automated
vehicle systems, and clinical decision support. Previous approaches to solving
the VCR task generally rely on pre-training or exploiting memory with long
dependency relationship encoded models. However, these approaches suffer from a
lack of generalizability and losing information in long sequences. In this
paper, we propose a parallel attention-based cognitive VCR network PAVCR, which
fuses visual-textual information efficiently and encodes semantic information
in parallel to enable the model to capture rich information for cognition-level
inference. Extensive experiments show that the proposed model yields
significant improvements over existing methods on the benchmark VCR dataset.
Moreover, the proposed model provides intuitive interpretation into visual
commonsense reasoning.
Related papers
- Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation [5.0923114224599555]
This paper introduces a novel hierarchical graph neural network (GNN) based model MissionGNN.
Our approach circumvents the limitations of previous methods by avoiding heavy gradient computations on large multimodal models.
Our model provides a practical and efficient solution for real-time video analysis without the constraints of previous segmentation-based or multimodal approaches.
arXiv Detail & Related papers (2024-06-27T01:09:07Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation [34.45251681923171]
This paper presents a novel approach to develop a large Vision-and-Language Models (VLMs)
We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process.
The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge.
arXiv Detail & Related papers (2024-01-18T14:21:56Z) - Correlation Information Bottleneck: Towards Adapting Pretrained
Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations.
We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z) - INFOrmation Prioritization through EmPOWERment in Visual Model-Based RL [90.06845886194235]
We propose a modified objective for model-based reinforcement learning (RL)
We integrate a term inspired by variational empowerment into a state-space model based on mutual information.
We evaluate the approach on a suite of vision-based robot control tasks with natural video backgrounds.
arXiv Detail & Related papers (2022-04-18T23:09:23Z) - KAT: A Knowledge Augmented Transformer for Vision-and-Language [56.716531169609915]
We propose a novel model - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result on the open-domain multimodal task of OK-VQA.
Our approach integrates implicit and explicit knowledge in an end to end encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation.
An additional benefit of explicit knowledge integration is seen in improved interpretability of model predictions in our analysis.
arXiv Detail & Related papers (2021-12-16T04:37:10Z) - Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory [10.544312410674985]
Visual Commonsense Reasoning (VCR) predicts an answer with corresponding rationale, given a question-image input.
Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models.
We propose a dynamic working memory based cognitive VCR network, which stores accumulated commonsense between sentences to provide prior knowledge for inference.
arXiv Detail & Related papers (2021-07-04T15:58:31Z) - KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual
Commonsense Reasoning [4.787501955202053]
In visual commonsense reasoning (VCR) task, a machine must answer correctly and then provide a rationale justifying its answer.
We propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model.
Besides taking visual and linguistic contents as input, external commonsense knowledge extracted from ConceptNet is integrated into the multi-layer Transformer.
arXiv Detail & Related papers (2020-12-13T08:22:33Z) - Visual Relationship Detection with Visual-Linguistic Knowledge from
Multimodal Representations [103.00383924074585]
Visual relationship detection aims to reason over relationships among salient objects in images.
We propose a novel approach named Visual-Linguistic Representations from Transformers (RVL-BERT)
RVL-BERT performs spatial reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training.
arXiv Detail & Related papers (2020-09-10T16:15:09Z) - Towards an Appropriate Query, Key, and Value Computation for Knowledge
Tracing [2.1541440354538564]
We propose a novel Transformer based model for knowledge tracing, SAINT: Separated Self-AttentIve Neural Knowledge Tracing.
SAINT has an encoder-decoder structure where exercise and response embedding sequence separately enter the encoder and the decoder respectively.
This is the first work to suggest an encoder-decoder model for knowledge tracing that applies deep self-attentive layers to exercises and responses separately.
arXiv Detail & Related papers (2020-02-14T09:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.