Exploring Question Decomposition for Zero-Shot VQA
- URL: http://arxiv.org/abs/2310.17050v1
- Date: Wed, 25 Oct 2023 23:23:57 GMT
- Title: Exploring Question Decomposition for Zero-Shot VQA
- Authors: Zaid Khan, Vijay Kumar BG, Samuel Schulter, Manmohan Chandraker, Yun
Fu
- Abstract summary: We investigate a question decomposition strategy for visual question answering.
We show that naive application of model-written decompositions can hurt performance.
We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors.
- Score: 99.32466439254821
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual question answering (VQA) has traditionally been treated as a
single-step task where each question receives the same amount of effort, unlike
natural human question-answering strategies. We explore a question
decomposition strategy for VQA to overcome this limitation. We probe the
ability of recently developed large vision-language models to use human-written
decompositions and produce their own decompositions of visual questions,
finding they are capable of learning both tasks from demonstrations alone.
However, we show that naive application of model-written decompositions can
hurt performance. We introduce a model-driven selective decomposition approach
for second-guessing predictions and correcting errors, and validate its
effectiveness on eight VQA tasks across three domains, showing consistent
improvements in accuracy, including improvements of >20% on medical VQA
datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA
reformulation of the challenging Winoground task. Project Site:
https://zaidkhan.me/decomposition-0shot-vqa/
Related papers
- Reducing Hallucinations: Enhancing VQA for Flood Disaster Damage
Assessment with Visual Contexts [6.820160182829294]
We propose a zero-shot VQA named Flood Disaster VQA with Two-Stage Prompt (VQA-TSP)
The model generates the thought process in the first stage and then uses the thought process to generate the final answer in the second stage.
Our method exceeds the performance of state-of-the-art zero-shot VQA models for flood disaster scenarios in total.
arXiv Detail & Related papers (2023-12-21T13:45:02Z) - Modularized Zero-shot VQA with Pre-trained Models [20.674979268279728]
We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable.
Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-05-27T05:00:14Z) - Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs)
We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training.
We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs.
This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z) - Reliable Visual Question Answering: Abstain Rather Than Answer
Incorrectly [100.60560477391732]
We promote a problem formulation for reliable visual question answering (VQA)
We analyze both their coverage, the portion of questions answered, and risk, the error on that portion.
We find that although the best performing models achieve over 71% accuracy on the VQA v2 dataset, introducing the option to abstain limits them to answering less than 8% of the questions to achieve a low risk of error (i.e., 1%)
This motivates us to utilize a multimodal selection function to directly estimate the correctness of the predicted answers, which we show can triple the coverage from, for example, 5.0% to 16.7% at
arXiv Detail & Related papers (2022-04-28T16:51:27Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.