Measuring Compositional Consistency for Video Question Answering
- URL: http://arxiv.org/abs/2204.07190v1
- Date: Thu, 14 Apr 2022 18:52:34 GMT
- Title: Measuring Compositional Consistency for Video Question Answering
- Authors: Mona Gandhi, Mustafa Omer Gul, Eva Prakash, Madeleine
Grunde-McLaughlin, Ranjay Krishna and Maneesh Agrawala
- Abstract summary: Recent question benchmarks indicate that state-the-art models struggle to answer compositional questions.
We present AGQADecomp, a benchmark containing an average $11.49$ sub-questions per graph, and $4.55M$ total new sub-questions.
We find models either cannot reason correctly through most compositions or are reliant on incorrect reasoning to reach answers.
- Score: 32.6742789254609
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent video question answering benchmarks indicate that state-of-the-art
models struggle to answer compositional questions. However, it remains unclear
which types of compositional reasoning cause models to mispredict. Furthermore,
it is difficult to discern whether models arrive at answers using compositional
reasoning or by leveraging data biases. In this paper, we develop a question
decomposition engine that programmatically deconstructs a compositional
question into a directed acyclic graph of sub-questions. The graph is designed
such that each parent question is a composition of its children. We present
AGQA-Decomp, a benchmark containing $2.3M$ question graphs, with an average of
$11.49$ sub-questions per graph, and $4.55M$ total new sub-questions. Using
question graphs, we evaluate three state-of-the-art models with a suite of
novel compositional consistency metrics. We find that models either cannot
reason correctly through most compositions or are reliant on incorrect
reasoning to reach answers, frequently contradicting themselves or achieving
high accuracies when failing at intermediate reasoning steps.
Related papers
- Measuring and Narrowing the Compositionality Gap in Language Models [116.5228850227024]
We measure how often models can correctly answer all sub-problems but not generate the overall solution.
We present a new method, self-ask, that further improves on chain of thought.
arXiv Detail & Related papers (2022-10-07T06:50:23Z) - ChartQA: A Benchmark for Question Answering about Charts with Visual and
Logical Reasoning [7.192233658525916]
We present a benchmark covering 9.6K human-written questions and 23.1K questions generated from human-written chart summaries.
We present two transformer-based models that combine visual features and the data table of the chart in a unified way to answer questions.
arXiv Detail & Related papers (2022-03-19T05:00:30Z) - Question-Answer Sentence Graph for Joint Modeling Answer Selection [122.29142965960138]
We train and integrate state-of-the-art (SOTA) models for computing scores between question-question, question-answer, and answer-answer pairs.
Online inference is then performed to solve the AS2 task on unseen queries.
arXiv Detail & Related papers (2022-02-16T05:59:53Z) - ExplaGraphs: An Explanation Graph Generation Task for Structured
Commonsense Reasoning [65.15423587105472]
We present a new generative and structured commonsense-reasoning task (and an associated dataset) of explanation graph generation for stance prediction.
Specifically, given a belief and an argument, a model has to predict whether the argument supports or counters the belief and also generate a commonsense-augmented graph that serves as non-trivial, complete, and unambiguous explanation for the predicted stance.
A significant 83% of our graphs contain external commonsense nodes with diverse structures and reasoning depths.
arXiv Detail & Related papers (2021-04-15T17:51:36Z) - AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning [33.29431287523664]
We present a new benchmark for pinpointing compositional-temporal reasoning.
AGQA contains $192M$ unbalanced answer pairs for $9.6K videos.
Human evaluators marked $86.02%$ of our question-answer pairs as correct, the best model achieves only $47.74%$ accuracy.
arXiv Detail & Related papers (2021-03-30T00:24:01Z) - SOrT-ing VQA Models : Contrastive Gradient Learning for Improved
Consistency [64.67155167618894]
We present a gradient-based interpretability approach to determine the questions most strongly correlated with the reasoning question on an image.
Next, we propose a contrastive gradient learning based approach called Sub-question Oriented Tuning (SOrT) which encourages models to rank relevant sub-questions higher than irrelevant questions for an image, reasoning-question> pair.
We show that SOrT improves model consistency by upto 6.5% points over existing baselines, while also improving visual grounding.
arXiv Detail & Related papers (2020-10-20T05:15:48Z) - Latent Compositional Representations Improve Systematic Generalization
in Grounded Question Answering [46.87501300706542]
State-of-the-art models in grounded question answering often do not explicitly perform decomposition.
We propose a model that computes a representation and denotation for all question spans in a bottom-up, compositional manner.
Our model induces latent trees, driven by end-to-end (the answer) only.
arXiv Detail & Related papers (2020-07-01T06:22:51Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.