Equivariant and Invariant Grounding for Video Question Answering
- URL: http://arxiv.org/abs/2207.12783v1
- Date: Tue, 26 Jul 2022 10:01:02 GMT
- Title: Equivariant and Invariant Grounding for Video Question Answering
- Authors: Yicong Li, Xiang Wang, Junbin Xiao, and Tat-Seng Chua
- Abstract summary: Most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure.
We devise a self-interpretable framework, Equivariant and Invariant Grounding for Interpretable VideoQA (EIGV)
EIGV is able to distinguish the causal scene from the environment information, and explicitly present the visual-linguistic alignment.
- Score: 68.33688981540998
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video Question Answering (VideoQA) is the task of answering the natural
language questions about a video. Producing an answer requires understanding
the interplay across visual scenes in video and linguistic semantics in
question. However, most leading VideoQA models work as black boxes, which make
the visual-linguistic alignment behind the answering process obscure. Such
black-box nature calls for visual explainability that reveals ``What part of
the video should the model look at to answer the question?''. Only a few works
present the visual explanations in a post-hoc fashion, which emulates the
target model's answering process via an additional method. Nonetheless, the
emulation struggles to faithfully exhibit the visual-linguistic alignment
during answering.
Instead of post-hoc explainability, we focus on intrinsic interpretability to
make the answering process transparent. At its core is grounding the
question-critical cues as the causal scene to yield answers, while rolling out
the question-irrelevant information as the environment scene. Taking a causal
look at VideoQA, we devise a self-interpretable framework, Equivariant and
Invariant Grounding for Interpretable VideoQA (EIGV). Specifically, the
equivariant grounding encourages the answering to be sensitive to the semantic
changes in the causal scene and question; in contrast, the invariant grounding
enforces the answering to be insensitive to the changes in the environment
scene. By imposing them on the answering process, EIGV is able to distinguish
the causal scene from the environment information, and explicitly present the
visual-linguistic alignment. Extensive experiments on three benchmark datasets
justify the superiority of EIGV in terms of accuracy and visual
interpretability over the leading baselines.
Related papers
- Boosting Audio Visual Question Answering via Key Semantic-Aware Cues [8.526720031181027]
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos.
We propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions.
arXiv Detail & Related papers (2024-07-30T09:41:37Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Interpretable Visual Question Answering via Reasoning Supervision [4.76359068115052]
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task.
We propose a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal.
We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase.
arXiv Detail & Related papers (2023-09-07T14:12:31Z) - Visual Causal Scene Refinement for Video Question Answering [117.08431221482638]
We present a causal analysis of VideoQA and propose a framework for cross-modal causal reasoning, named Visual Causal Scene Refinement (VCSR)
Our VCSR involves two essential modules, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention.
Experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering.
arXiv Detail & Related papers (2023-05-07T09:05:19Z) - Tell Me the Evidence? Dual Visual-Linguistic Interaction for Answer
Grounding [27.9150632791267]
We propose Dual Visual-Linguistic Interaction (DaVI), a novel unified end-to-end framework with the capability for both linguistic answering and visual grounding.
DaVI innovatively introduces two visual-linguistic interaction mechanisms: 1) visual-based linguistic encoder that understands questions incorporated with visual features and produces linguistic-oriented evidence for further answer decoding, and 2) linguistic-based visual decoder that focuses visual features on the evidence-related regions for answer grounding.
arXiv Detail & Related papers (2022-06-21T03:15:27Z) - Invariant Grounding for Video Question Answering [72.87173324555846]
Video Question Answering (VideoQA) is the task of answering questions about a video.
In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers.
We propose a new learning framework, Invariant Grounding for VideoQA (IGV), to ground the question-critical scene.
arXiv Detail & Related papers (2022-06-06T04:37:52Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - Unified Questioner Transformer for Descriptive Question Generation in
Goal-Oriented Visual Dialogue [0.0]
Building an interactive artificial intelligence that can ask questions about the real world is one of the biggest challenges for vision and language problems.
We propose a novel Questioner architecture, called Unified Questioner Transformer (UniQer)
We build a goal-oriented visual dialogue task called CLEVR Ask. It synthesizes complex scenes that require the Questioner to generate descriptive questions.
arXiv Detail & Related papers (2021-06-29T16:36:34Z) - HySTER: A Hybrid Spatio-Temporal Event Reasoner [75.41988728376081]
We present the HySTER: a Hybrid Spatio-Temporal Event Reasoner to reason over physical events in videos.
We define a method based on general temporal, causal and physics rules which can be transferred across tasks.
This work sets the foundations for the incorporation of inductive logic programming in the field of VideoQA.
arXiv Detail & Related papers (2021-01-17T11:07:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.