Cross-Modal Causal Relational Reasoning for Event-Level Visual Question
Answering
- URL: http://arxiv.org/abs/2207.12647v8
- Date: Wed, 7 Jun 2023 07:47:27 GMT
- Title: Cross-Modal Causal Relational Reasoning for Event-Level Visual Question
Answering
- Authors: Yang Liu, Guanbin Li, Liang Lin
- Abstract summary: Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes.
We propose a framework for cross-modal causal relational reasoning to address the task of event-level visual question answering.
- Score: 134.91774666260338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing visual question answering methods often suffer from cross-modal
spurious correlations and oversimplified event-level reasoning processes that
fail to capture event temporality, causality, and dynamics spanning over the
video. In this work, to address the task of event-level visual question
answering, we propose a framework for cross-modal causal relational reasoning.
In particular, a set of causal intervention operations is introduced to
discover the underlying causal structures across visual and linguistic
modalities. Our framework, named Cross-Modal Causal RelatIonal Reasoning
(CMCIR), involves three modules: i) Causality-aware Visual-Linguistic Reasoning
(CVLR) module for collaboratively disentangling the visual and linguistic
spurious correlations via front-door and back-door causal interventions; ii)
Spatial-Temporal Transformer (STT) module for capturing the fine-grained
interactions between visual and linguistic semantics; iii) Visual-Linguistic
Feature Fusion (VLFF) module for learning the global semantic-aware
visual-linguistic representations adaptively. Extensive experiments on four
event-level datasets demonstrate the superiority of our CMCIR in discovering
visual-linguistic causal structures and achieving robust event-level visual
question answering. The datasets, code, and models are available at
https://github.com/HCPLab-SYSU/CMCIR.
Related papers
- Integrating Large Language Models with Graph-based Reasoning for Conversational Question Answering [58.17090503446995]
We focus on a conversational question answering task which combines the challenges of understanding questions in context and reasoning over evidence gathered from heterogeneous sources like text, knowledge graphs, tables, and infoboxes.
Our method utilizes a graph structured representation to aggregate information about a question and its context.
arXiv Detail & Related papers (2024-06-14T13:28:03Z) - Vision-and-Language Navigation via Causal Learning [13.221880074458227]
Cross-modal causal transformer (GOAT) is a pioneering solution rooted in the paradigm of causal inference.
BACL and FACL modules promote unbiased learning by comprehensively mitigating potential spurious correlations.
To capture global confounder features, we propose a cross-modal feature pooling module supervised by contrastive learning.
arXiv Detail & Related papers (2024-04-16T02:40:35Z) - Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs [61.796960984541464]
We present COM2 (COMplex COMmonsense), a new dataset created by sampling logical queries.
We verbalize them using handcrafted rules and large language models into multiple-choice and text generation questions.
Experiments show that language models trained on COM2 exhibit significant improvements in complex reasoning ability.
arXiv Detail & Related papers (2024-03-12T08:13:52Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Visual Causal Scene Refinement for Video Question Answering [117.08431221482638]
We present a causal analysis of VideoQA and propose a framework for cross-modal causal reasoning, named Visual Causal Scene Refinement (VCSR)
Our VCSR involves two essential modules, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention.
Experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering.
arXiv Detail & Related papers (2023-05-07T09:05:19Z) - VCD: Visual Causality Discovery for Cross-Modal Question Reasoning [11.161509939879428]
We propose a visual question reasoning framework named Cross-Modal Question Reasoning (CMQR)
To explicitly discover visual causal structure, the Visual Causality Discovery (VCD) architecture is proposed to find question-critical scene temporally.
To align the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build an Interactive Visual-Linguistic Transformer (IVLT)
arXiv Detail & Related papers (2023-04-17T08:56:16Z) - Cross-Modal Causal Intervention for Medical Report Generation [109.83549148448469]
Medical report generation (MRG) is essential for computer-aided diagnosis and medication guidance.
Due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports reliably describing lesion areas.
We propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM)
arXiv Detail & Related papers (2023-03-16T07:23:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.