Modularized Zero-shot VQA with Pre-trained Models
- URL: http://arxiv.org/abs/2305.17369v2
- Date: Wed, 24 Jan 2024 12:22:40 GMT
- Title: Modularized Zero-shot VQA with Pre-trained Models
- Authors: Rui Cao and Jing Jiang
- Abstract summary: We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable.
Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method.
- Score: 20.674979268279728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In
this paper, we study how to leverage them for zero-shot visual question
answering (VQA). Our approach is motivated by a few observations. First, VQA
questions often require multiple steps of reasoning, which is still a
capability that most PTMs lack. Second, different steps in VQA reasoning chains
require different skills such as object detection and relational reasoning, but
a single PTM may not possess all these skills. Third, recent work on zero-shot
VQA does not explicitly consider multi-step reasoning chains, which makes them
less interpretable compared with a decomposition-based approach. We propose a
modularized zero-shot network that explicitly decomposes questions into sub
reasoning steps and is highly interpretable. We convert sub reasoning tasks to
acceptable objectives of PTMs and assign tasks to proper PTMs without any
adaptation. Our experiments on two VQA benchmarks under the zero-shot setting
demonstrate the effectiveness of our method and better interpretability
compared with several baselines.
Related papers
- Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering [19.351516992903697]
We propose emphMixture of Rationales (MoR), a novel multi-modal reasoning method that mixes multiple rationales for zero-shot visual question answering.
MoR achieves a 12.43% accuracy improvement on NLVR2, and a 2.45% accuracy improvement on OKVQA-S.
arXiv Detail & Related papers (2024-06-03T15:04:47Z) - Improving Zero-shot Visual Question Answering via Large Language Models
with Reasoning Question Prompts [22.669502403623166]
We present Reasoning Question Prompts for VQA tasks, which can further activate the potential of Large Language Models.
We generate self-contained questions as reasoning question prompts via an unsupervised question edition module.
Each reasoning question prompt clearly indicates the intent of the original question.
Then, the candidate answers associated with their confidence scores acting as answer integritys are fed into LLMs.
arXiv Detail & Related papers (2023-11-15T15:40:46Z) - Exploring Question Decomposition for Zero-Shot VQA [99.32466439254821]
We investigate a question decomposition strategy for visual question answering.
We show that naive application of model-written decompositions can hurt performance.
We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors.
arXiv Detail & Related papers (2023-10-25T23:23:57Z) - Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs)
We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training.
We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs.
This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z) - Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models
with Zero Training [82.30343537942608]
We propose Plug-and-Play VQA (PNP-VQA), a modular framework for zero-shot VQA.
We first generate question-guided informative image captions, and pass the captions to a PLM as context for question answering.
PNP-VQA achieves state-of-the-art results on zero-shot VQAv2 and GQA.
arXiv Detail & Related papers (2022-10-17T06:29:54Z) - Counterfactual Variable Control for Robust and Interpretable Question
Answering [57.25261576239862]
Deep neural network based question answering (QA) models are neither robust nor explainable in many cases.
In this paper, we inspect such spurious "capability" of QA models using causal inference.
We propose a novel approach called Counterfactual Variable Control (CVC) that explicitly mitigates any shortcut correlation.
arXiv Detail & Related papers (2020-10-12T10:09:05Z) - Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA)
First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA)
Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.