STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results
for Video Question Answering
- URL: http://arxiv.org/abs/2401.03901v1
- Date: Mon, 8 Jan 2024 14:01:59 GMT
- Title: STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results
for Video Question Answering
- Authors: Yueqian Wang, Yuxuan Wang, Kai Chen, Dongyan Zhao
- Abstract summary: We propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering.
STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks.
We conduct extensive experiments on several video question answering datasets to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available.
- Score: 42.173245795917026
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently we have witnessed the rapid development of video question answering
models. However, most models can only handle simple videos in terms of temporal
reasoning, and their performance tends to drop when answering
temporal-reasoning questions on long and informative videos. To tackle this
problem we propose STAIR, a Spatial-Temporal Reasoning model with Auditable
Intermediate Results for video question answering. STAIR is a neural module
network, which contains a program generator to decompose a given question into
a hierarchical combination of several sub-tasks, and a set of lightweight
neural modules to complete each of these sub-tasks. Though neural module
networks are already widely studied on image-text tasks, applying them to
videos is a non-trivial task, as reasoning on videos requires different
abilities. In this paper, we define a set of basic video-text sub-tasks for
video question answering and design a set of lightweight modules to complete
them. Different from most prior works, modules of STAIR return intermediate
outputs specific to their intentions instead of always returning attention
maps, which makes it easier to interpret and collaborate with pre-trained
models. We also introduce intermediate supervision to make these intermediate
outputs more accurate. We conduct extensive experiments on several video
question answering datasets under various settings to show STAIR's performance,
explainability, compatibility with pre-trained models, and applicability when
program annotations are not available. Code:
https://github.com/yellow-binary-tree/STAIR
Related papers
- Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - Spatio-temporal Prompting Network for Robust Video Feature Extraction [74.54597668310707]
Frametemporal is one of the main challenges in the field of video understanding.
Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information.
We present a neat and unified framework called N-Temporal Prompting Network (NNSTP)
It can efficiently extract video features by adjusting the input features in the network backbone.
arXiv Detail & Related papers (2024-02-04T17:52:04Z) - Analyzing Zero-Shot Abilities of Vision-Language Models on Video
Understanding Tasks [6.925770576386087]
We propose a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting.
Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC.
These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.
arXiv Detail & Related papers (2023-10-07T20:57:54Z) - Locate before Answering: Answer Guided Question Localization for Video
Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z) - Revealing Single Frame Bias for Video-and-Language Learning [115.01000652123882]
We show that a single-frame trained model can achieve better performance than existing methods that use multiple frames for training.
This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets.
We propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling.
arXiv Detail & Related papers (2022-06-07T16:28:30Z) - Fill-in-the-blank as a Challenging Video Understanding Evaluation
Framework [19.031957183047048]
We introduce a novel dataset consisting of 28,000 videos and fill-in-the-blank tests.
We show that both a multimodal model and a strong language model have a large gap with human performance.
arXiv Detail & Related papers (2021-04-09T04:00:10Z) - Text Modular Networks: Learning to Decompose Tasks in the Language of
Existing Models [61.480085460269514]
We propose a framework for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models.
We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator.
arXiv Detail & Related papers (2020-09-01T23:45:42Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.