Video Question Answering: Datasets, Algorithms and Challenges
- URL: http://arxiv.org/abs/2203.01225v1
- Date: Wed, 2 Mar 2022 16:34:09 GMT
- Title: Video Question Answering: Datasets, Algorithms and Challenges
- Authors: Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, Tat-Seng
Chua
- Abstract summary: Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
This paper provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges.
- Score: 99.9179674610955
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Question Answering (VideoQA) aims to answer natural language questions
according to the given videos. It has earned increasing attention with recent
research trends in joint vision and language understanding. Yet, compared with
ImageQA, VideoQA is largely underexplored and progresses slowly. Although
different algorithms have continually been proposed and shown success on
different VideoQA datasets, we find that there lacks a meaningful survey to
categorize them, which seriously impedes its advancements. This paper thus
provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on
the datasets, algorithms, and unique challenges. We then point out the research
trend of studying beyond factoid QA to inference QA towards the cognition of
video contents, Finally, we conclude some promising directions for future
exploration.
Related papers
- Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex
and Professional Sports [90.79212954022218]
We introduce the first dataset, named Sports-QA, specifically designed for the sports VideoQA task.
Sports-QA dataset includes various types of questions, such as descriptions, chronologies, causalities, and counterfactual conditions.
We propose a new Auto-Focus Transformer (AFT) capable of automatically focusing on particular scales of temporal information for question answering.
arXiv Detail & Related papers (2024-01-03T02:22:34Z) - MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie
Understanding [69.04413943858584]
We introduce MoVQA, a long-form movie question-answering dataset.
We also benchmark to assess the diverse cognitive capabilities of multimodal systems.
arXiv Detail & Related papers (2023-12-08T03:33:38Z) - Can I Trust Your Answer? Visually Grounded Video Question Answering [88.11169242115416]
We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding.
We construct NExT-GQA -- an extension of NExT-QA with 10.5$K$ temporal grounding labels tied to the original QA pairs.
arXiv Detail & Related papers (2023-09-04T03:06:04Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Modern Question Answering Datasets and Benchmarks: A Survey [5.026863544662493]
Question Answering (QA) is one of the most important natural language processing (NLP) tasks.
It aims using NLP technologies to generate a corresponding answer to a given question based on the massive unstructured corpus.
In this paper, we investigate influential QA datasets that have been released in the era of deep learning.
arXiv Detail & Related papers (2022-06-30T05:53:56Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - Recent Advances in Video Question Answering: A Review of Datasets and
Methods [0.0]
VQA helps to retrieve temporal and spatial information from the video scenes and interpret it.
To the best of our knowledge, no previous survey has been conducted for the VQA task.
arXiv Detail & Related papers (2021-01-15T03:26:24Z) - End-to-End Video Question-Answer Generation with Generator-Pretester
Network [27.31969951281815]
We study a novel task, Video Question-Answer Generation (VQAG) for challenging Video Question Answering (Video QA) task in multimedia.
As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG)
We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances.
arXiv Detail & Related papers (2021-01-05T10:46:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.