BinaryVQA: A Versatile Test Set to Evaluate the Out-of-Distribution
Generalization of VQA Models
- URL: http://arxiv.org/abs/2301.12032v1
- Date: Sat, 28 Jan 2023 00:03:44 GMT
- Title: BinaryVQA: A Versatile Test Set to Evaluate the Out-of-Distribution
Generalization of VQA Models
- Authors: Ali Borji
- Abstract summary: We introduce a new test set for visual question answering (VQA) called BinaryVQA to push the limits of VQA models.
Our dataset includes 7,800 questions across 1,024 images and covers a wide variety of objects, topics, and concepts.
Around 63% of the questions have positive answers.
- Score: 47.64219291655723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a new test set for visual question answering (VQA) called
BinaryVQA to push the limits of VQA models. Our dataset includes 7,800
questions across 1,024 images and covers a wide variety of objects, topics, and
concepts. For easy model evaluation, we only consider binary questions.
Questions and answers are formulated and verified carefully and manually.
Around 63% of the questions have positive answers. The median number of
questions per image and question length are 7 and 5, respectively. The state of
the art OFA model achieves 75% accuracy on BinaryVQA dataset, which is
significantly lower than its performance on the VQA v2 test-dev dataset
(94.7%). We also analyze the model behavior along several dimensions including:
a) performance over different categories such as text, counting and gaze
direction, b) model interpretability, c) the effect of question length on
accuracy, d) bias of models towards positive answers and introduction of a new
score called the ShuffleAcc, and e) sensitivity to spelling and grammar errors.
Our investigation demonstrates the difficulty of our dataset and shows that it
can challenge VQA models for next few years. Data and code are publicly
available at: DATA and CODE.
Related papers
- Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model [4.41132900194195]
We propose a new method called it chain of QA for human-written questions (CoQAH)
CoQAH utilizes a sequence of QA interactions between a large language model and a VQA model trained on synthetic data to reason and derive logical answers for human-written questions.
We tested the effectiveness of CoQAH on two types of human-written VQA datasets for 3D-rendered and chest X-ray images.
arXiv Detail & Related papers (2024-01-12T06:49:49Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs)
We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training.
We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs.
This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - IQ-VQA: Intelligent Visual Question Answering [3.09911862091928]
We show that our framework improves consistency of VQA models by 15% on the rule-based dataset.
We also quantitatively show improvement in attention maps which highlights better multi-modal understanding of vision and language.
arXiv Detail & Related papers (2020-07-08T20:41:52Z) - Selective Question Answering under Domain Shift [90.021577320085]
Abstention policies based solely on the model's softmax probabilities fare poorly, since models are overconfident on out-of-domain inputs.
We train a calibrator to identify inputs on which the QA model errs, and abstain when it predicts an error is likely.
Our method answers 56% of questions while maintaining 80% accuracy; in contrast, directly using the model's probabilities only answers 48% at 80% accuracy.
arXiv Detail & Related papers (2020-06-16T19:13:21Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.