CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
- URL: http://arxiv.org/abs/2109.11797v1
- Date: Fri, 24 Sep 2021 08:07:29 GMT
- Title: CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
- Authors: Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua,
Maosong Sun
- Abstract summary: We present Cross-modal Prompt Tuning (CPT), a novel paradigm for tuning Vision-Language Models (VL-PTMs)
CPT reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap.
Comprehensive experimental results show that prompt tuned VL-PTMs outperform their fine-tuned counterparts by a large margin.
- Score: 101.5066760592534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-Trained Vision-Language Models (VL-PTMs) have shown promising
capabilities in grounding natural language in image data, facilitating a broad
variety of cross-modal tasks. However, we note that there exists a significant
gap between the objective forms of model pre-training and fine-tuning,
resulting in a need for quantities of labeled data to stimulate the visual
grounding capability of VL-PTMs for downstream tasks. To address the challenge,
we present Cross-modal Prompt Tuning (CPT, alternatively, Colorful Prompt
Tuning), a novel paradigm for tuning VL-PTMs, which reformulates visual
grounding into a fill-in-the-blank problem with color-based co-referential
markers in image and text, maximally mitigating the gap. In this way, our
prompt tuning approach enables strong few-shot and even zero-shot visual
grounding capabilities of VL-PTMs. Comprehensive experimental results show that
prompt tuned VL-PTMs outperform their fine-tuned counterparts by a large margin
(e.g., 17.3% absolute accuracy improvement, and 73.8% relative standard
deviation reduction on average with one shot in RefCOCO evaluation). All the
data and code will be available to facilitate future research.
Related papers
- Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task [15.642102189777072]
Cross Visual Prompt Tuning is a new type of visual fine-tuning.
CVPT calculates cross-attention between the prompt tokens and the embedded tokens, which allows us to compute the semantic relationship between them.
CVPT significantly improves VPT's performance and efficiency in visual tasks.
arXiv Detail & Related papers (2024-08-27T11:07:19Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Unified Vision and Language Prompt Learning [86.1530128487077]
We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning.
A major finding is that text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances.
To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities.
arXiv Detail & Related papers (2022-10-13T17:50:24Z) - Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language
Models [107.05966685291067]
We propose test-time prompt tuning (TPT) to learn adaptive prompts on the fly with a single test sample.
TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average.
In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
arXiv Detail & Related papers (2022-09-15T17:55:11Z) - Pro-tuning: Unified Prompt Tuning for Vision Tasks [133.12978197265596]
Fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks.
In this work, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
arXiv Detail & Related papers (2022-07-28T21:09:31Z) - Declaration-based Prompt Tuning for Visual Question Answering [16.688288454811016]
We propose an innovative visual-language (VL) fine-tuning paradigm (named Declaration-based Prompt Tuning, abbreviated as DPT)
DPT jointly optimize the objectives of pre-training and fine-tuning of VQA model, boosting the effective adaptation of pre-trained VL models to the downstream task.
Experimental results on GQA dataset show that DPT outperforms the fine-tuned counterpart by a large margin regarding accuracy in both fully-supervised (2.68%) and zero-shot/few-shot (over 31%) settings.
arXiv Detail & Related papers (2022-05-05T05:56:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.