NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions
- URL: http://arxiv.org/abs/2105.08276v1
- Date: Tue, 18 May 2021 04:56:46 GMT
- Title: NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions
- Authors: Junbin Xiao, Xindi Shang, Angela Yao and Tat-Seng Chua
- Abstract summary: We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
- Score: 80.60423934589515
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce NExT-QA, a rigorously designed video question answering
(VideoQA) benchmark to advance video understanding from describing to
explaining the temporal actions. Based on the dataset, we set up multi-choice
and open-ended QA tasks targeting causal action reasoning, temporal action
reasoning, and common scene comprehension. Through extensive analysis of
baselines and established VideoQA techniques, we find that top-performing
methods excel at shallow scene descriptions but are weak in causal and temporal
action reasoning. Furthermore, the models that are effective on multi-choice
QA, when adapted to open-ended QA, still struggle in generalizing the answers.
This raises doubt on the ability of these models to reason and highlights
possibilities for improvement. With detailed results for different question
types and heuristic observations for future works, we hope NExT-QA will guide
the next generation of VQA research to go beyond superficial scene description
towards a deeper understanding of videos. (The dataset and related resources
are available at https://github.com/doc-doc/NExT-QA.git)
Related papers
- TimeLogic: A Temporal Logic Benchmark for Video QA [64.32208175236323]
We introduce the TimeLogic QA (TLQA) framework to automatically generate temporal logical questions.
We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate 2k and 10k QA pairs for each category.
We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.
arXiv Detail & Related papers (2025-01-13T11:12:59Z) - Can I Trust Your Answer? Visually Grounded Video Question Answering [88.11169242115416]
We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding.
We construct NExT-GQA -- an extension of NExT-QA with 10.5$K$ temporal grounding labels tied to the original QA pairs.
arXiv Detail & Related papers (2023-09-04T03:06:04Z) - Open-vocabulary Video Question Answering: A New Benchmark for Evaluating
the Generalizability of Video Question Answering Models [15.994664381976984]
We introduce a new benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the generalizability of VideoQA models.
In addition, we introduce a novel GNN-based soft verbalizer that enhances the prediction on rare and unseen answers.
Our ablation studies and qualitative analyses demonstrate that our GNN-based soft verbalizer further improves the model performance.
arXiv Detail & Related papers (2023-08-18T07:45:10Z) - Locate before Answering: Answer Guided Question Localization for Video
Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z) - Invariant Grounding for Video Question Answering [72.87173324555846]
Video Question Answering (VideoQA) is the task of answering questions about a video.
In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers.
We propose a new learning framework, Invariant Grounding for VideoQA (IGV), to ground the question-critical scene.
arXiv Detail & Related papers (2022-06-06T04:37:52Z) - Video Question Answering: Datasets, Algorithms and Challenges [99.9179674610955]
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
This paper provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges.
arXiv Detail & Related papers (2022-03-02T16:34:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.