Attention Mechanism based Cognition-level Scene Understanding
        - URL: http://arxiv.org/abs/2204.08027v2
- Date: Tue, 19 Apr 2022 02:40:42 GMT
- Title: Attention Mechanism based Cognition-level Scene Understanding
- Authors: Xuejiao Tang, Tai Le Quy, Eirini Ntoutsi, Kea Turner, Vasile Palade,
  Israat Haque, Peng Xu, Chris Brown and Wenbin Zhang
- Abstract summary: The Visual Commonsense Reasoning (VCR) model can predict an answer with the corresponding rationale, which requires inference ability from the real world.
Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models.
We propose a parallel attention-based cognitive VCR network PAVCR, which fuses visual-textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference.
- Score: 23.592893555879538
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract:   Given a question-image input, the Visual Commonsense Reasoning (VCR) model
can predict an answer with the corresponding rationale, which requires
inference ability from the real world. The VCR task, which calls for exploiting
the multi-source information as well as learning different levels of
understanding and extensive commonsense knowledge, is a cognition-level scene
understanding task. The VCR task has aroused researchers' interest due to its
wide range of applications, including visual question answering, automated
vehicle systems, and clinical decision support. Previous approaches to solving
the VCR task generally rely on pre-training or exploiting memory with long
dependency relationship encoded models. However, these approaches suffer from a
lack of generalizability and losing information in long sequences. In this
paper, we propose a parallel attention-based cognitive VCR network PAVCR, which
fuses visual-textual information efficiently and encodes semantic information
in parallel to enable the model to capture rich information for cognition-level
inference. Extensive experiments show that the proposed model yields
significant improvements over existing methods on the benchmark VCR dataset.
Moreover, the proposed model provides intuitive interpretation into visual
commonsense reasoning.
 
      
        Related papers
        - VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich   Information Understanding via Iterative Reasoning with Reinforcement Learning [45.39372905700317]
 We introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information.<n>With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories.<n>Our approach highlights key limitations of RL in RAG domains.
 arXiv  Detail & Related papers  (2025-05-28T06:30:51Z)
- ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification   Reinforcement Learning [68.76048244253582]
 We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
 arXiv  Detail & Related papers  (2025-05-21T12:29:40Z)
- VITAL: More Understandable Feature Visualization through Distribution   Alignment and Relevant Information Flow [57.96482272333649]
 Feature visualization (FV) is a powerful tool to decode what information neurons are responding to.
We propose to guide FV through statistics of prototypical image features combined with measures of relevant network flow to generate images.
Our approach yields human-understandable visualizations that both qualitatively and quantitatively improve over state-of-the-art FVs.
 arXiv  Detail & Related papers  (2025-03-28T13:08:18Z)
- Open-Ended and Knowledge-Intensive Video Question Answering [20.256081440725353]
 We investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation.
Our analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models.
We achieve a substantial 17.5% improvement in accuracy on multiple choice questions in the KnowIT VQA dataset.
 arXiv  Detail & Related papers  (2025-02-17T12:40:35Z)
- Video Representation Learning with Joint-Embedding Predictive   Architectures [23.250749688875196]
 We present Video JEPA with Variance-Covariance Regularization (VJ-VCR): a joint-embedding predictive architecture for self-supervised video representation learning.
We show that hidden representations from our VJ-VCR contain abstract, high-level information about the input data.
 arXiv  Detail & Related papers  (2024-12-14T18:33:29Z)
- Prompting Video-Language Foundation Models with Domain-specific   Fine-grained Heuristics for Video Question Answering [71.62961521518731]
 HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
 arXiv  Detail & Related papers  (2024-10-12T06:22:23Z)
- MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video   Anomaly Recognition with Mission-Specific Knowledge Graph Generation [5.0923114224599555]
 This paper introduces a novel hierarchical graph neural network (GNN) based model MissionGNN.
Our approach circumvents the limitations of previous methods by avoiding heavy gradient computations on large multimodal models.
Our model provides a practical and efficient solution for real-time video analysis without the constraints of previous segmentation-based or multimodal approaches.
 arXiv  Detail & Related papers  (2024-06-27T01:09:07Z)
- Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language   Models [81.71651422951074]
 Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
 arXiv  Detail & Related papers  (2024-03-19T17:59:52Z)
- Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and   Visual Question Generation [34.45251681923171]
 This paper presents a novel approach to develop a large Vision-and-Language Models (VLMs)
We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process.
The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge.
 arXiv  Detail & Related papers  (2024-01-18T14:21:56Z)
- Correlation Information Bottleneck: Towards Adapting Pretrained
  Multimodal Models for Robust Visual Question Answering [63.87200781247364]
 Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations.
We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
 arXiv  Detail & Related papers  (2022-09-14T22:04:10Z)
- INFOrmation Prioritization through EmPOWERment in Visual Model-Based RL [90.06845886194235]
 We propose a modified objective for model-based reinforcement learning (RL)
We integrate a term inspired by variational empowerment into a state-space model based on mutual information.
We evaluate the approach on a suite of vision-based robot control tasks with natural video backgrounds.
 arXiv  Detail & Related papers  (2022-04-18T23:09:23Z)
- KAT: A Knowledge Augmented Transformer for Vision-and-Language [56.716531169609915]
 We propose a novel model - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result on the open-domain multimodal task of OK-VQA.
Our approach integrates implicit and explicit knowledge in an end to end encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation.
An additional benefit of explicit knowledge integration is seen in improved interpretability of model predictions in our analysis.
 arXiv  Detail & Related papers  (2021-12-16T04:37:10Z)
- Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory [10.544312410674985]
 Visual Commonsense Reasoning (VCR) predicts an answer with corresponding rationale, given a question-image input.
Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models.
We propose a dynamic working memory based cognitive VCR network, which stores accumulated commonsense between sentences to provide prior knowledge for inference.
 arXiv  Detail & Related papers  (2021-07-04T15:58:31Z)
- KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual
  Commonsense Reasoning [4.787501955202053]
 In visual commonsense reasoning (VCR) task, a machine must answer correctly and then provide a rationale justifying its answer.
We propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model.
Besides taking visual and linguistic contents as input, external commonsense knowledge extracted from ConceptNet is integrated into the multi-layer Transformer.
 arXiv  Detail & Related papers  (2020-12-13T08:22:33Z)
- Visual Relationship Detection with Visual-Linguistic Knowledge from
  Multimodal Representations [103.00383924074585]
 Visual relationship detection aims to reason over relationships among salient objects in images.
We propose a novel approach named Visual-Linguistic Representations from Transformers (RVL-BERT)
RVL-BERT performs spatial reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training.
 arXiv  Detail & Related papers  (2020-09-10T16:15:09Z)
- Towards an Appropriate Query, Key, and Value Computation for Knowledge
  Tracing [2.1541440354538564]
 We propose a novel Transformer based model for knowledge tracing, SAINT: Separated Self-AttentIve Neural Knowledge Tracing.
 SAINT has an encoder-decoder structure where exercise and response embedding sequence separately enter the encoder and the decoder respectively.
This is the first work to suggest an encoder-decoder model for knowledge tracing that applies deep self-attentive layers to exercises and responses separately.
 arXiv  Detail & Related papers  (2020-02-14T09:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.