Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models
with Zero Training
- URL: http://arxiv.org/abs/2210.08773v3
- Date: Mon, 20 Mar 2023 02:55:26 GMT
- Title: Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models
with Zero Training
- Authors: Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, Steven
C.H. Hoi
- Abstract summary: We propose Plug-and-Play VQA (PNP-VQA), a modular framework for zero-shot VQA.
We first generate question-guided informative image captions, and pass the captions to a PLM as context for question answering.
PNP-VQA achieves state-of-the-art results on zero-shot VQAv2 and GQA.
- Score: 82.30343537942608
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual question answering (VQA) is a hallmark of vision and language
reasoning and a challenging task under the zero-shot setting. We propose
Plug-and-Play VQA (PNP-VQA), a modular framework for zero-shot VQA. In contrast
to most existing works, which require substantial adaptation of pretrained
language models (PLMs) for the vision modality, PNP-VQA requires no additional
training of the PLMs. Instead, we propose to use natural language and network
interpretation as an intermediate representation that glues pretrained models
together. We first generate question-guided informative image captions, and
pass the captions to a PLM as context for question answering. Surpassing
end-to-end trained baselines, PNP-VQA achieves state-of-the-art results on
zero-shot VQAv2 and GQA. With 11B parameters, it outperforms the 80B-parameter
Flamingo model by 8.5% on VQAv2. With 738M PLM parameters, PNP-VQA achieves an
improvement of 9.1% on GQA over FewVLM with 740M PLM parameters. Code is
released at https://github.com/salesforce/LAVIS/tree/main/projects/pnp-vqa
Related papers
- Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts [3.6064695344878093]
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content.
This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.
arXiv Detail & Related papers (2024-04-12T16:35:23Z) - Zero-shot Translation of Attention Patterns in VQA Models to Natural
Language [65.94419474119162]
ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training.
We consider this in the context of Visual Question Answering (VQA)
Our framework does not require any training and allows the drop-in replacement of different guiding sources.
arXiv Detail & Related papers (2023-11-08T22:18:53Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Modular Visual Question Answering via Code Generation [134.59005611826777]
We present a framework that formulates visual question answering as modular code generation.
Our approach requires no additional training and relies on pre-trained language models (LMs), visual models pre-trained on image-caption pairs, and fifty VQA examples used for in-context learning.
Our approach improves accuracy on the COVR dataset by at least 3% and on the GQA dataset by roughly 2% compared to the few-shot baseline that does not employ code generation.
arXiv Detail & Related papers (2023-06-08T17:45:14Z) - Modularized Zero-shot VQA with Pre-trained Models [20.674979268279728]
We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable.
Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-05-27T05:00:14Z) - From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language
Models [111.42052290293965]
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks.
End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive.
We propose emphImg2Prompt, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections.
arXiv Detail & Related papers (2022-12-21T08:39:36Z) - A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based
Learning for Vision-Language Models [50.27305012063483]
FewVLM is a few-shot prompt-based learner on vision-language tasks.
We pretrain a sequence-to-sequence Transformer model with both prefix language modeling (PrefixLM) and masked language modeling (MaskedLM)
We observe that prompts significantly affect zero-shot performance but marginally affect few-shot performance.
arXiv Detail & Related papers (2021-10-16T06:07:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.