VCD: Visual Causality Discovery for Cross-Modal Question Reasoning
- URL: http://arxiv.org/abs/2304.08083v2
- Date: Mon, 11 Dec 2023 04:54:39 GMT
- Title: VCD: Visual Causality Discovery for Cross-Modal Question Reasoning
- Authors: Yang Liu, Ying Tan, Jingzhou Luo, Weixing Chen
- Abstract summary: We propose a visual question reasoning framework named Cross-Modal Question Reasoning (CMQR)
To explicitly discover visual causal structure, the Visual Causality Discovery (VCD) architecture is proposed to find question-critical scene temporally.
To align the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build an Interactive Visual-Linguistic Transformer (IVLT)
- Score: 11.161509939879428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing visual question reasoning methods usually fail to explicitly
discover the inherent causal mechanism and ignore jointly modeling cross-modal
event temporality and causality. In this paper, we propose a visual question
reasoning framework named Cross-Modal Question Reasoning (CMQR), to discover
temporal causal structure and mitigate visual spurious correlation by causal
intervention. To explicitly discover visual causal structure, the Visual
Causality Discovery (VCD) architecture is proposed to find question-critical
scene temporally and disentangle the visual spurious correlations by
attention-based front-door causal intervention module named Local-Global Causal
Attention Module (LGCAM). To align the fine-grained interactions between
linguistic semantics and spatial-temporal representations, we build an
Interactive Visual-Linguistic Transformer (IVLT) that builds the multi-modal
co-occurrence interactions between visual and linguistic content. Extensive
experiments on four datasets demonstrate the superiority of CMQR for
discovering visual causal structures and achieving robust question reasoning.
Related papers
- Towards Context-Aware Emotion Recognition Debiasing from a Causal Demystification Perspective via De-confounded Training [14.450673163785094]
Context-Aware Emotion Recognition (CAER) provides valuable semantic cues for recognizing the emotions of target persons.
Current approaches invariably focus on designing sophisticated structures to extract perceptually critical representations from contexts.
We present a Contextual Causal Intervention Module (CCIM) to de-confound the confounder.
arXiv Detail & Related papers (2024-07-06T05:29:02Z) - Cause and Effect: Can Large Language Models Truly Understand Causality? [1.2334534968968969]
This research proposes a novel architecture called Context Aware Reasoning Enhancement with Counterfactual Analysis(CARE CA) framework.
The proposed framework incorporates an explicit causal detection module with ConceptNet and counterfactual statements, as well as implicit causal detection through Large Language Models.
The knowledge from ConceptNet enhances the performance of multiple causal reasoning tasks such as causal discovery, causal identification and counterfactual reasoning.
arXiv Detail & Related papers (2024-02-28T08:02:14Z) - Interpretable Visual Question Answering via Reasoning Supervision [4.76359068115052]
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task.
We propose a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal.
We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase.
arXiv Detail & Related papers (2023-09-07T14:12:31Z) - Visual Causal Scene Refinement for Video Question Answering [117.08431221482638]
We present a causal analysis of VideoQA and propose a framework for cross-modal causal reasoning, named Visual Causal Scene Refinement (VCSR)
Our VCSR involves two essential modules, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention.
Experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering.
arXiv Detail & Related papers (2023-05-07T09:05:19Z) - Cross-Modal Causal Intervention for Medical Report Generation [109.83549148448469]
Medical report generation (MRG) is essential for computer-aided diagnosis and medication guidance.
Due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports reliably describing lesion areas.
We propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM)
arXiv Detail & Related papers (2023-03-16T07:23:55Z) - Hierarchical Graph Neural Networks for Causal Discovery and Root Cause
Localization [52.72490784720227]
REASON consists of Topological Causal Discovery and Individual Causal Discovery.
The Topological Causal Discovery component aims to model the fault propagation in order to trace back to the root causes.
The Individual Causal Discovery component focuses on capturing abrupt change patterns of a single system entity.
arXiv Detail & Related papers (2023-02-03T20:17:45Z) - Cross-Modal Causal Relational Reasoning for Event-Level Visual Question
Answering [134.91774666260338]
Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes.
We propose a framework for cross-modal causal relational reasoning to address the task of event-level visual question answering.
arXiv Detail & Related papers (2022-07-26T04:25:54Z) - Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding.
This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects.
The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.