Visual Causal Scene Refinement for Video Question Answering
- URL: http://arxiv.org/abs/2305.04224v2
- Date: Tue, 1 Aug 2023 02:46:43 GMT
- Title: Visual Causal Scene Refinement for Video Question Answering
- Authors: Yushen Wei, Yang Liu, Hong Yan, Guanbin Li, Liang Lin
- Abstract summary: We present a causal analysis of VideoQA and propose a framework for cross-modal causal reasoning, named Visual Causal Scene Refinement (VCSR)
Our VCSR involves two essential modules, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention.
Experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering.
- Score: 117.08431221482638
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing methods for video question answering (VideoQA) often suffer from
spurious correlations between different modalities, leading to a failure in
identifying the dominant visual evidence and the intended question. Moreover,
these methods function as black boxes, making it difficult to interpret the
visual scene during the QA process. In this paper, to discover critical video
segments and frames that serve as the visual causal scene for generating
reliable answers, we present a causal analysis of VideoQA and propose a
framework for cross-modal causal relational reasoning, named Visual Causal
Scene Refinement (VCSR). Particularly, a set of causal front-door intervention
operations is introduced to explicitly find the visual causal scenes at both
segment and frame levels. Our VCSR involves two essential modules: i) the
Question-Guided Refiner (QGR) module, which refines consecutive video frames
guided by the question semantics to obtain more representative segment features
for causal front-door intervention; ii) the Causal Scene Separator (CSS)
module, which discovers a collection of visual causal and non-causal scenes
based on the visual-linguistic causal relevance and estimates the causal effect
of the scene-separating intervention in a contrastive learning manner.
Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets
demonstrate the superiority of our VCSR in discovering visual causal scene and
achieving robust video question answering. The code is available at
https://github.com/YangLiu9208/VCSR.
Related papers
- LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Semantic-aware Dynamic Retrospective-Prospective Reasoning for
Event-level Video Question Answering [14.659023742381777]
Event-Level Video Question Answering (EVQA) requires complex reasoning across video events to provide optimal answers.
We propose a semantic-aware dynamic retrospective-prospective reasoning approach for video-based question answering.
Our proposed approach achieves superior performance compared to previous state-of-the-art models.
arXiv Detail & Related papers (2023-05-14T03:57:11Z) - VCD: Visual Causality Discovery for Cross-Modal Question Reasoning [11.161509939879428]
We propose a visual question reasoning framework named Cross-Modal Question Reasoning (CMQR)
To explicitly discover visual causal structure, the Visual Causality Discovery (VCD) architecture is proposed to find question-critical scene temporally.
To align the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build an Interactive Visual-Linguistic Transformer (IVLT)
arXiv Detail & Related papers (2023-04-17T08:56:16Z) - Equivariant and Invariant Grounding for Video Question Answering [68.33688981540998]
Most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure.
We devise a self-interpretable framework, Equivariant and Invariant Grounding for Interpretable VideoQA (EIGV)
EIGV is able to distinguish the causal scene from the environment information, and explicitly present the visual-linguistic alignment.
arXiv Detail & Related papers (2022-07-26T10:01:02Z) - Cross-Modal Causal Relational Reasoning for Event-Level Visual Question
Answering [134.91774666260338]
Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes.
We propose a framework for cross-modal causal relational reasoning to address the task of event-level visual question answering.
arXiv Detail & Related papers (2022-07-26T04:25:54Z) - Invariant Grounding for Video Question Answering [72.87173324555846]
Video Question Answering (VideoQA) is the task of answering questions about a video.
In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers.
We propose a new learning framework, Invariant Grounding for VideoQA (IGV), to ground the question-critical scene.
arXiv Detail & Related papers (2022-06-06T04:37:52Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.