Reliable Visual Question Answering: Abstain Rather Than Answer
Incorrectly
- URL: http://arxiv.org/abs/2204.13631v1
- Date: Thu, 28 Apr 2022 16:51:27 GMT
- Title: Reliable Visual Question Answering: Abstain Rather Than Answer
Incorrectly
- Authors: Spencer Whitehead, Suzanne Petryk, Vedaad Shakib, Joseph Gonzalez,
Trevor Darrell, Anna Rohrbach, Marcus Rohrbach
- Abstract summary: We promote a problem formulation for reliable visual question answering (VQA)
We analyze both their coverage, the portion of questions answered, and risk, the error on that portion.
We find that although the best performing models achieve over 71% accuracy on the VQA v2 dataset, introducing the option to abstain limits them to answering less than 8% of the questions to achieve a low risk of error (i.e., 1%)
This motivates us to utilize a multimodal selection function to directly estimate the correctness of the predicted answers, which we show can triple the coverage from, for example, 5.0% to 16.7% at
- Score: 100.60560477391732
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning has advanced dramatically, narrowing the accuracy gap to
humans in multimodal tasks like visual question answering (VQA). However, while
humans can say "I don't know" when they are uncertain (i.e., abstain from
answering a question), such ability has been largely neglected in multimodal
research, despite the importance of this problem to the usage of VQA in real
settings. In this work, we promote a problem formulation for reliable VQA,
where we prefer abstention over providing an incorrect answer. We first enable
abstention capabilities for several VQA models, and analyze both their
coverage, the portion of questions answered, and risk, the error on that
portion. For that we explore several abstention approaches. We find that
although the best performing models achieve over 71% accuracy on the VQA v2
dataset, introducing the option to abstain by directly using a model's softmax
scores limits them to answering less than 8% of the questions to achieve a low
risk of error (i.e., 1%). This motivates us to utilize a multimodal selection
function to directly estimate the correctness of the predicted answers, which
we show can triple the coverage from, for example, 5.0% to 16.7% at 1% risk.
While it is important to analyze both coverage and risk, these metrics have a
trade-off which makes comparing VQA models challenging. To address this, we
also propose an Effective Reliability metric for VQA that places a larger cost
on incorrect answers compared to abstentions. This new problem formulation,
metric, and analysis for VQA provide the groundwork for building effective and
reliable VQA models that have the self-awareness to abstain if and only if they
don't know the answer.
Related papers
- Exploring Question Decomposition for Zero-Shot VQA [99.32466439254821]
We investigate a question decomposition strategy for visual question answering.
We show that naive application of model-written decompositions can hurt performance.
We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors.
arXiv Detail & Related papers (2023-10-25T23:23:57Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Improving Selective Visual Question Answering by Learning from Your
Peers [74.20167944693424]
Visual Question Answering (VQA) models can have difficulties abstaining from answering when they are wrong.
We propose Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions.
Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model.
arXiv Detail & Related papers (2023-06-14T21:22:01Z) - BinaryVQA: A Versatile Test Set to Evaluate the Out-of-Distribution
Generalization of VQA Models [47.64219291655723]
We introduce a new test set for visual question answering (VQA) called BinaryVQA to push the limits of VQA models.
Our dataset includes 7,800 questions across 1,024 images and covers a wide variety of objects, topics, and concepts.
Around 63% of the questions have positive answers.
arXiv Detail & Related papers (2023-01-28T00:03:44Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z) - Selective Question Answering under Domain Shift [90.021577320085]
Abstention policies based solely on the model's softmax probabilities fare poorly, since models are overconfident on out-of-domain inputs.
We train a calibrator to identify inputs on which the QA model errs, and abstain when it predicts an error is likely.
Our method answers 56% of questions while maintaining 80% accuracy; in contrast, directly using the model's probabilities only answers 48% at 80% accuracy.
arXiv Detail & Related papers (2020-06-16T19:13:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.