AGQA 2.0: An Updated Benchmark for Compositional Spatio-Temporal
Reasoning
- URL: http://arxiv.org/abs/2204.06105v1
- Date: Tue, 12 Apr 2022 22:30:12 GMT
- Title: AGQA 2.0: An Updated Benchmark for Compositional Spatio-Temporal
Reasoning
- Authors: Madeleine Grunde-McLaughlin, Ranjay Krishna, Maneesh Agrawala
- Abstract summary: Action Genome Question Answering (AGQA) is one such benchmark.
We introduce AGQA 2.0, a version of this benchmark with several improvements.
- Score: 45.60498204105834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior benchmarks have analyzed models' answers to questions about videos in
order to measure visual compositional reasoning. Action Genome Question
Answering (AGQA) is one such benchmark. AGQA provides a training/test split
with balanced answer distributions to reduce the effect of linguistic biases.
However, some biases remain in several AGQA categories. We introduce AGQA 2.0,
a version of this benchmark with several improvements, most namely a stricter
balancing procedure. We then report results on the updated benchmark for all
experiments.
Related papers
- SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning
over Untrimmed Videos [120.80589215132322]
We present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over challenging untrimmed videos from ActivityNet.
ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos.
The best model achieves 44.5% accuracy while human performance tops out at 84.5%, leaving sufficient room for improvement.
arXiv Detail & Related papers (2023-05-04T03:04:59Z) - RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question
Answering [87.18962441714976]
We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA)
We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging.
Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
arXiv Detail & Related papers (2022-10-25T21:39:36Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning [33.29431287523664]
We present a new benchmark for pinpointing compositional-temporal reasoning.
AGQA contains $192M$ unbalanced answer pairs for $9.6K videos.
Human evaluators marked $86.02%$ of our question-answer pairs as correct, the best model achieves only $47.74%$ accuracy.
arXiv Detail & Related papers (2021-03-30T00:24:01Z) - Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA)
First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA)
Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.