ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning
over Untrimmed Videos
- URL: http://arxiv.org/abs/2305.02519v1
- Date: Thu, 4 May 2023 03:04:59 GMT
- Title: ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning
over Untrimmed Videos
- Authors: Zhou Yu, Lixiang Zheng, Zhou Zhao, Fei Wu, Jianping Fan, Kui Ren, Jun
Yu
- Abstract summary: We present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over challenging untrimmed videos from ActivityNet.
ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos.
The best model achieves 44.5% accuracy while human performance tops out at 84.5%, leaving sufficient room for improvement.
- Score: 120.80589215132322
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Building benchmarks to systemically analyze different capabilities of video
question answering (VideoQA) models is challenging yet crucial. Existing
benchmarks often use non-compositional simple questions and suffer from
language biases, making it difficult to diagnose model weaknesses incisively. A
recent benchmark AGQA poses a promising paradigm to generate QA pairs
automatically from pre-annotated scene graphs, enabling it to measure diverse
reasoning abilities with granular control. However, its questions have
limitations in reasoning about the fine-grained semantics in videos as such
information is absent in its scene graphs. To this end, we present ANetQA, a
large-scale benchmark that supports fine-grained compositional reasoning over
the challenging untrimmed videos from ActivityNet. Similar to AGQA, the QA
pairs in ANetQA are automatically generated from annotated video scene graphs.
The fine-grained properties of ANetQA are reflected in the following: (i)
untrimmed videos with fine-grained semantics; (ii) spatio-temporal scene graphs
with fine-grained taxonomies; and (iii) diverse questions generated from
fine-grained templates. ANetQA attains 1.4 billion unbalanced and 13.4 million
balanced QA pairs, which is an order of magnitude larger than AGQA with a
similar number of videos. Comprehensive experiments are performed for
state-of-the-art methods. The best model achieves 44.5% accuracy while human
performance tops out at 84.5%, leaving sufficient room for improvement.
Related papers
- NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - Learning Situation Hyper-Graphs for Video Question Answering [95.18071873415556]
We propose an architecture for Video Question Answering (VQA) that enables answering questions related to video content by predicting situation hyper-graphs.
We train a situation hyper-graph decoder to implicitly identify graph representations with actions and object/human-object relationships from the input video clip.
Our results show that learning the underlying situation hyper-graphs helps the system to significantly improve its performance for novel challenges of video question-answering tasks.
arXiv Detail & Related papers (2023-04-18T01:23:11Z) - RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question
Answering [87.18962441714976]
We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA)
We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging.
Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
arXiv Detail & Related papers (2022-10-25T21:39:36Z) - NOAHQA: Numerical Reasoning with Interpretable Graph Question Answering
Dataset [26.782937852417454]
We introduce NOAHQA, a bilingual QA dataset with questions requiring numerical reasoning with compound mathematical expressions.
We evaluate the state-of-the-art QA models trained using existing QA datasets on NOAHQA and show that the best among them can only achieve 55.5 exact match scores.
We also present a new QA model for generating a reasoning graph where the reasoning graph metric still has a large gap compared with that of humans.
arXiv Detail & Related papers (2021-09-22T09:17:09Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning [33.29431287523664]
We present a new benchmark for pinpointing compositional-temporal reasoning.
AGQA contains $192M$ unbalanced answer pairs for $9.6K videos.
Human evaluators marked $86.02%$ of our question-answer pairs as correct, the best model achieves only $47.74%$ accuracy.
arXiv Detail & Related papers (2021-03-30T00:24:01Z) - Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA)
First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA)
Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.