Invariant Grounding for Video Question Answering
- URL: http://arxiv.org/abs/2206.02349v1
- Date: Mon, 6 Jun 2022 04:37:52 GMT
- Title: Invariant Grounding for Video Question Answering
- Authors: Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, Tat-Seng Chua
- Abstract summary: Video Question Answering (VideoQA) is the task of answering questions about a video.
In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers.
We propose a new learning framework, Invariant Grounding for VideoQA (IGV), to ground the question-critical scene.
- Score: 72.87173324555846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Question Answering (VideoQA) is the task of answering questions about a
video. At its core is understanding the alignments between visual scenes in
video and linguistic semantics in question to yield the answer. In leading
VideoQA models, the typical learning objective, empirical risk minimization
(ERM), latches on superficial correlations between video-question pairs and
answers as the alignments. However, ERM can be problematic, because it tends to
over-exploit the spurious correlations between question-irrelevant scenes and
answers, instead of inspecting the causal effect of question-critical scenes.
As a result, the VideoQA models suffer from unreliable reasoning. In this work,
we first take a causal look at VideoQA and argue that invariant grounding is
the key to ruling out the spurious correlations. Towards this end, we propose a
new learning framework, Invariant Grounding for VideoQA (IGV), to ground the
question-critical scene, whose causal relations with answers are invariant
across different interventions on the complement. With IGV, the VideoQA models
are forced to shield the answering process from the negative influence of
spurious correlations, which significantly improves the reasoning ability.
Experiments on three benchmark datasets validate the superiority of IGV in
terms of accuracy, visual explainability, and generalization ability over the
leading baselines.
Related papers
- Can I Trust Your Answer? Visually Grounded Video Question Answering [88.11169242115416]
We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding.
We construct NExT-GQA -- an extension of NExT-QA with 10.5$K$ temporal grounding labels tied to the original QA pairs.
arXiv Detail & Related papers (2023-09-04T03:06:04Z) - Open-vocabulary Video Question Answering: A New Benchmark for Evaluating
the Generalizability of Video Question Answering Models [15.994664381976984]
We introduce a new benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the generalizability of VideoQA models.
In addition, we introduce a novel GNN-based soft verbalizer that enhances the prediction on rare and unseen answers.
Our ablation studies and qualitative analyses demonstrate that our GNN-based soft verbalizer further improves the model performance.
arXiv Detail & Related papers (2023-08-18T07:45:10Z) - Visual Causal Scene Refinement for Video Question Answering [117.08431221482638]
We present a causal analysis of VideoQA and propose a framework for cross-modal causal reasoning, named Visual Causal Scene Refinement (VCSR)
Our VCSR involves two essential modules, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention.
Experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering.
arXiv Detail & Related papers (2023-05-07T09:05:19Z) - Language Models are Causal Knowledge Extractors for Zero-shot Video
Question Answering [60.93164850492871]
Causal Video Question Answering (CVidQA) queries not only association or temporal relations but also causal relations in a video.
We propose a novel framework, Causal Knowledge Extraction from Language Models (CaKE-LM), leveraging causal commonsense knowledge from language models to tackle CVidQA.
CaKE-LM significantly outperforms conventional methods by 4% to 6% of zero-shot CVidQA accuracy on NExT-QA and Causal-VidQA datasets.
arXiv Detail & Related papers (2023-04-07T17:45:49Z) - Equivariant and Invariant Grounding for Video Question Answering [68.33688981540998]
Most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure.
We devise a self-interpretable framework, Equivariant and Invariant Grounding for Interpretable VideoQA (EIGV)
EIGV is able to distinguish the causal scene from the environment information, and explicitly present the visual-linguistic alignment.
arXiv Detail & Related papers (2022-07-26T10:01:02Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.