Semantic-aware Dynamic Retrospective-Prospective Reasoning for
Event-level Video Question Answering
- URL: http://arxiv.org/abs/2305.08059v1
- Date: Sun, 14 May 2023 03:57:11 GMT
- Title: Semantic-aware Dynamic Retrospective-Prospective Reasoning for
Event-level Video Question Answering
- Authors: Chenyang Lyu, Tianbo Ji, Yvette Graham, Jennifer Foster
- Abstract summary: Event-Level Video Question Answering (EVQA) requires complex reasoning across video events to provide optimal answers.
We propose a semantic-aware dynamic retrospective-prospective reasoning approach for video-based question answering.
Our proposed approach achieves superior performance compared to previous state-of-the-art models.
- Score: 14.659023742381777
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Event-Level Video Question Answering (EVQA) requires complex reasoning across
video events to obtain the visual information needed to provide optimal
answers. However, despite significant progress in model performance, few
studies have focused on using the explicit semantic connections between the
question and visual information especially at the event level. There is need
for using such semantic connections to facilitate complex reasoning across
video frames. Therefore, we propose a semantic-aware dynamic
retrospective-prospective reasoning approach for video-based question
answering. Specifically, we explicitly use the Semantic Role Labeling (SRL)
structure of the question in the dynamic reasoning process where we decide to
move to the next frame based on which part of the SRL structure (agent, verb,
patient, etc.) of the question is being focused on. We conduct experiments on a
benchmark EVQA dataset - TrafficQA. Results show that our proposed approach
achieves superior performance compared to previous state-of-the-art models. Our
code will be made publicly available for research use.
Related papers
- DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries [60.09774333024783]
We introduce Dynamic Anchor Queries (DAQ) to shorten the transition gap between the anchor and target queries.
We also introduce a query-level object Emergence and Disappearance Simulation (EDS) strategy, which unleashes DAQ's potential without any additional cost.
Experiments demonstrate that DVIS-DAQ achieves a new state-of-the-art (SOTA) performance on five mainstream video segmentation benchmarks.
arXiv Detail & Related papers (2024-03-29T17:58:50Z) - Cross-Modal Reasoning with Event Correlation for Video Question
Answering [32.332251488360185]
We introduce the dense caption modality as a new auxiliary and distill event-correlated information from it to infer the correct answer.
We employ cross-modal reasoning modules for explicitly modeling inter-modal relationships and aggregating relevant information across different modalities.
We propose a question-guided self-adaptive multi-modal fusion module to collect the question-oriented and event-correlated evidence through multi-step reasoning.
arXiv Detail & Related papers (2023-12-20T02:30:39Z) - Visual Causal Scene Refinement for Video Question Answering [117.08431221482638]
We present a causal analysis of VideoQA and propose a framework for cross-modal causal reasoning, named Visual Causal Scene Refinement (VCSR)
Our VCSR involves two essential modules, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention.
Experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering.
arXiv Detail & Related papers (2023-05-07T09:05:19Z) - Locate before Answering: Answer Guided Question Localization for Video
Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - Relation-aware Hierarchical Attention Framework for Video Question
Answering [6.312182279855817]
We propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos.
In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features.
We consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer.
arXiv Detail & Related papers (2021-05-13T09:35:42Z) - HySTER: A Hybrid Spatio-Temporal Event Reasoner [75.41988728376081]
We present the HySTER: a Hybrid Spatio-Temporal Event Reasoner to reason over physical events in videos.
We define a method based on general temporal, causal and physics rules which can be transferred across tasks.
This work sets the foundations for the incorporation of inductive logic programming in the field of VideoQA.
arXiv Detail & Related papers (2021-01-17T11:07:17Z) - Self-supervised pre-training and contrastive representation learning for
multiple-choice video QA [39.78914328623504]
Video Question Answering (Video QA) requires fine-grained understanding of both video and language modalities to answer the given questions.
We propose novel training schemes for multiple-choice video question answering with a self-supervised pre-training stage and a supervised contrastive learning in the main stage as an auxiliary learning.
We evaluate our proposed model on highly competitive benchmark datasets related to multiple-choice video QA: TVQA, TVQA+, and DramaQA.
arXiv Detail & Related papers (2020-09-17T03:37:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.