AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
- URL: http://arxiv.org/abs/2103.16002v1
- Date: Tue, 30 Mar 2021 00:24:01 GMT
- Title: AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
- Authors: Madeleine Grunde-McLaughlin, Ranjay Krishna, Maneesh Agrawala
- Abstract summary: We present a new benchmark for pinpointing compositional-temporal reasoning.
AGQA contains $192M$ unbalanced answer pairs for $9.6K videos.
Human evaluators marked $86.02%$ of our question-answer pairs as correct, the best model achieves only $47.74%$ accuracy.
- Score: 33.29431287523664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual events are a composition of temporal actions involving actors
spatially interacting with objects. When developing computer vision models that
can reason about compositional spatio-temporal events, we need benchmarks that
can analyze progress and uncover shortcomings. Existing video question
answering benchmarks are useful, but they often conflate multiple sources of
error into one accuracy metric and have strong biases that models can exploit,
making it difficult to pinpoint model weaknesses. We present Action Genome
Question Answering (AGQA), a new benchmark for compositional spatio-temporal
reasoning. AGQA contains $192M$ unbalanced question answer pairs for $9.6K$
videos. We also provide a balanced subset of $3.9M$ question answer pairs, $3$
orders of magnitude larger than existing benchmarks, that minimizes bias by
balancing the answer distributions and types of question structures. Although
human evaluators marked $86.02\%$ of our question-answer pairs as correct, the
best model achieves only $47.74\%$ accuracy. In addition, AGQA introduces
multiple training/test splits to test for various reasoning abilities,
including generalization to novel compositions, to indirect references, and to
more compositional steps. Using AGQA, we evaluate modern visual reasoning
systems, demonstrating that the best models barely perform better than
non-visual baselines exploiting linguistic biases and that none of the existing
models generalize to novel compositions unseen during training.
Related papers
- ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning
over Untrimmed Videos [120.80589215132322]
We present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over challenging untrimmed videos from ActivityNet.
ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos.
The best model achieves 44.5% accuracy while human performance tops out at 84.5%, leaving sufficient room for improvement.
arXiv Detail & Related papers (2023-05-04T03:04:59Z) - RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question
Answering [87.18962441714976]
We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA)
We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging.
Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
arXiv Detail & Related papers (2022-10-25T21:39:36Z) - Measuring Compositional Consistency for Video Question Answering [32.6742789254609]
Recent question benchmarks indicate that state-the-art models struggle to answer compositional questions.
We present AGQADecomp, a benchmark containing an average $11.49$ sub-questions per graph, and $4.55M$ total new sub-questions.
We find models either cannot reason correctly through most compositions or are reliant on incorrect reasoning to reach answers.
arXiv Detail & Related papers (2022-04-14T18:52:34Z) - AGQA 2.0: An Updated Benchmark for Compositional Spatio-Temporal
Reasoning [45.60498204105834]
Action Genome Question Answering (AGQA) is one such benchmark.
We introduce AGQA 2.0, a version of this benchmark with several improvements.
arXiv Detail & Related papers (2022-04-12T22:30:12Z) - Beyond Accuracy: A Consolidated Tool for Visual Question Answering
Benchmarking [30.155625852894797]
We propose a browser-based benchmarking tool for researchers and challenge organizers.
Our tool helps test generalization capabilities of models across multiple datasets.
Interactive filtering facilitates discovery of problematic behavior.
arXiv Detail & Related papers (2021-10-11T11:08:35Z) - NOAHQA: Numerical Reasoning with Interpretable Graph Question Answering
Dataset [26.782937852417454]
We introduce NOAHQA, a bilingual QA dataset with questions requiring numerical reasoning with compound mathematical expressions.
We evaluate the state-of-the-art QA models trained using existing QA datasets on NOAHQA and show that the best among them can only achieve 55.5 exact match scores.
We also present a new QA model for generating a reasoning graph where the reasoning graph metric still has a large gap compared with that of humans.
arXiv Detail & Related papers (2021-09-22T09:17:09Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Counterfactual Variable Control for Robust and Interpretable Question
Answering [57.25261576239862]
Deep neural network based question answering (QA) models are neither robust nor explainable in many cases.
In this paper, we inspect such spurious "capability" of QA models using causal inference.
We propose a novel approach called Counterfactual Variable Control (CVC) that explicitly mitigates any shortcut correlation.
arXiv Detail & Related papers (2020-10-12T10:09:05Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.