Prompt Tuning based Adapter for Vision-Language Model Adaption
- URL: http://arxiv.org/abs/2303.15234v1
- Date: Fri, 24 Mar 2023 15:05:17 GMT
- Title: Prompt Tuning based Adapter for Vision-Language Model Adaption
- Authors: Jingchen Sun, Jiayu Qin, Zihao Lin, Changyou Chen
- Abstract summary: We introduce a new model, termed Prompt-Adapter, that combines pre-trained prompt tunning with an efficient adaptation network.
Our approach beat the state-of-the-art methods in few-shot image classification on the public 11 datasets.
Our proposed method demonstrates the promise of combining prompt tuning and parameter-efficient networks for efficient vision-language model adaptation.
- Score: 38.576215369504446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large pre-trained vision-language (VL) models have shown significant promise
in adapting to various downstream tasks. However, fine-tuning the entire
network is challenging due to the massive number of model parameters. To
address this issue, efficient adaptation methods such as prompt tuning have
been proposed. We explore the idea of prompt tuning with multi-task pre-trained
initialization and find it can significantly improve model performance. Based
on our findings, we introduce a new model, termed Prompt-Adapter, that combines
pre-trained prompt tunning with an efficient adaptation network. Our approach
beat the state-of-the-art methods in few-shot image classification on the
public 11 datasets, especially in settings with limited data instances such as
1 shot, 2 shots, 4 shots, and 8 shots images. Our proposed method demonstrates
the promise of combining prompt tuning and parameter-efficient networks for
efficient vision-language model adaptation. The code is publicly available at:
https://github.com/Jingchensun/prompt_adapter.
Related papers
- Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models [38.751158173278796]
This work introduces a cross-modal parameter-efficient approach named XMAdapter.
XMAdapter establishes cache models for both text and image modalities.
It then leverages retrieval through visual-language bimodal information to gather clues for inference.
arXiv Detail & Related papers (2024-04-19T02:33:23Z) - BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile
Screenshot Captioning [0.5893124686141781]
This study proposes a combination of adapter methods, which necessitates tuning only the additional modules on the model.
By freezing the parameters of the image caption models and training only the weights associated with the methods, performance comparable to fine-tuning the entire model can be achieved.
arXiv Detail & Related papers (2023-09-26T09:16:44Z) - AudioToken: Adaptation of Text-Conditioned Diffusion Models for
Audio-to-Image Generation [89.63430567887718]
We propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings.
Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations.
arXiv Detail & Related papers (2023-05-22T14:02:44Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - UniAdapter: Unified Parameter-Efficient Transfer Learning for
Cross-modal Modeling [49.134517040512414]
This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on vision-language models.
Experiments show that UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy.
arXiv Detail & Related papers (2023-02-13T18:59:10Z) - Unified Vision and Language Prompt Learning [86.1530128487077]
We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning.
A major finding is that text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances.
To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities.
arXiv Detail & Related papers (2022-10-13T17:50:24Z) - SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained
Models [9.017387427570538]
Vision-language models such as CLIP are pretrained on large volumes of internet sourced image and text pairs.
Due to their size, fine-tuning these models on new datasets can be prohibitively expensive, both in terms of the supervision and compute required.
We present a new approach called SVL-Adapter that combines the complementary strengths of both vision-language pretraining and self-supervised representation learning.
arXiv Detail & Related papers (2022-10-07T19:35:08Z) - Pro-tuning: Unified Prompt Tuning for Vision Tasks [133.12978197265596]
Fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks.
In this work, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
arXiv Detail & Related papers (2022-07-28T21:09:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.