A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based
Learning for Vision-Language Models
- URL: http://arxiv.org/abs/2110.08484v1
- Date: Sat, 16 Oct 2021 06:07:59 GMT
- Title: A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based
Learning for Vision-Language Models
- Authors: Woojeong Jin, Yu Cheng, Yelong Shen, Weizhu Chen, Xiang Ren
- Abstract summary: FewVLM is a few-shot prompt-based learner on vision-language tasks.
We pretrain a sequence-to-sequence Transformer model with both prefix language modeling (PrefixLM) and masked language modeling (MaskedLM)
We observe that prompts significantly affect zero-shot performance but marginally affect few-shot performance.
- Score: 50.27305012063483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large pretrained vision-language (VL) models can learn a new task with a
handful of examples or generalize to a new task without fine-tuning. However,
these gigantic VL models are hard to deploy for real-world applications due to
their impractically huge model size and slow inference speed. In this work, we
propose FewVLM, a few-shot prompt-based learner on vision-language tasks. We
pretrain a sequence-to-sequence Transformer model with both prefix language
modeling (PrefixLM) and masked language modeling (MaskedLM), and introduce
simple prompts to improve zero-shot and few-shot performance on VQA and image
captioning. Experimental results on five VQA and captioning datasets show that
\method\xspace outperforms Frozen which is 31 times larger than ours by 18.2%
point on zero-shot VQAv2 and achieves comparable results to a 246$\times$
larger model, PICa. We observe that (1) prompts significantly affect zero-shot
performance but marginally affect few-shot performance, (2) MaskedLM helps
few-shot VQA tasks while PrefixLM boosts captioning performance, and (3)
performance significantly increases when training set size is small.
Related papers
- PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning [78.23573511641548]
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications.
Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources.
This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for video understanding.
arXiv Detail & Related papers (2024-04-25T19:29:55Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action
Recognition with Language Knowledge [35.45809761628721]
Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities.
We propose an unsupervised approach to tuning video data for best zero-shot action recognition performance.
Our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks.
arXiv Detail & Related papers (2023-03-15T20:17:41Z) - From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language
Models [111.42052290293965]
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks.
End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive.
We propose emphImg2Prompt, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections.
arXiv Detail & Related papers (2022-12-21T08:39:36Z) - EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge
Distillation and Modal-adaptive Pruning [19.354515754130592]
We introduce a distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones.
We apply our framework to train EfficientVLM, a fast and accurate vision-language model consisting of 6 vision layers, 3 text layers, and 3 cross-modal fusion layers.
EfficientVLM retains 98.4% performance of the teacher model and accelerates its inference speed by 2.2x.
arXiv Detail & Related papers (2022-10-14T13:26:41Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.