Discovering Spatio-Temporal Rationales for Video Question Answering
- URL: http://arxiv.org/abs/2307.12058v1
- Date: Sat, 22 Jul 2023 12:00:26 GMT
- Title: Discovering Spatio-Temporal Rationales for Video Question Answering
- Authors: Yicong Li, Junbin Xiao, Chun Feng, Xiang Wang, Tat-Seng Chua
- Abstract summary: This paper strives to solve complex video question answering (VideoQA) which features long video containing multiple objects and events at different time.
We propose a Spatio-Temporal Rationalization (STR) that adaptively collects question-critical moments and objects using cross-modal interaction.
We also propose TranSTR, a Transformer-style neural network architecture that takes STR as the core and additionally underscores a novel answer interaction mechanism.
- Score: 68.33688981540998
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper strives to solve complex video question answering (VideoQA) which
features long video containing multiple objects and events at different time.
To tackle the challenge, we highlight the importance of identifying
question-critical temporal moments and spatial objects from the vast amount of
video content. Towards this, we propose a Spatio-Temporal Rationalization
(STR), a differentiable selection module that adaptively collects
question-critical moments and objects using cross-modal interaction. The
discovered video moments and objects are then served as grounded rationales to
support answer reasoning. Based on STR, we further propose TranSTR, a
Transformer-style neural network architecture that takes STR as the core and
additionally underscores a novel answer interaction mechanism to coordinate STR
for answer decoding. Experiments on four datasets show that TranSTR achieves
new state-of-the-art (SoTA). Especially, on NExT-QA and Causal-VidQA which
feature complex VideoQA, it significantly surpasses the previous SoTA by 5.8\%
and 6.8\%, respectively. We then conduct extensive studies to verify the
importance of STR as well as the proposed answer interaction mechanism. With
the success of TranSTR and our comprehensive analysis, we hope this work can
spark more future efforts in complex VideoQA. Code will be released at
https://github.com/yl3800/TranSTR.
Related papers
- VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence.
Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o.
We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z) - Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering [0.9712140341805068]
We propose a neural-symbolic framework called Symbolic-world VideoQA (NSVideo-QA) for real-world VideoQA tasks.
NSVideo-QA exhibits internal consistency in answering compositional questions and significantly improves the capability of logical inference for VideoQA tasks.
arXiv Detail & Related papers (2024-04-05T10:30:38Z) - Semantic-aware Dynamic Retrospective-Prospective Reasoning for
Event-level Video Question Answering [14.659023742381777]
Event-Level Video Question Answering (EVQA) requires complex reasoning across video events to provide optimal answers.
We propose a semantic-aware dynamic retrospective-prospective reasoning approach for video-based question answering.
Our proposed approach achieves superior performance compared to previous state-of-the-art models.
arXiv Detail & Related papers (2023-05-14T03:57:11Z) - Locate before Answering: Answer Guided Question Localization for Video
Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z) - Structured Two-stream Attention Network for Video Question Answering [168.95603875458113]
We propose a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question.
First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features.
Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text.
arXiv Detail & Related papers (2022-06-02T12:25:52Z) - Relation-aware Hierarchical Attention Framework for Video Question
Answering [6.312182279855817]
We propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos.
In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features.
We consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer.
arXiv Detail & Related papers (2021-05-13T09:35:42Z) - Hierarchical Conditional Relation Networks for Multimodal Video Question
Answering [67.85579756590478]
Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of a linguistic query.
Conditional Relation Network (CRN) takes as input a set of tensorial objects translating into a new set of objects that encode relations of the inputs.
CRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, is presented.
arXiv Detail & Related papers (2020-10-18T02:31:06Z) - Hierarchical Conditional Relation Networks for Video Question Answering [62.1146543269993]
We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN)
CRN serves as a building block to construct more sophisticated structures for representation and reasoning over video.
Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.
arXiv Detail & Related papers (2020-02-25T07:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.