Related papers: CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

URL: http://arxiv.org/abs/2109.11797v1
Date: Fri, 24 Sep 2021 08:07:29 GMT
Title: CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
Authors: Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun
Abstract summary: We present Cross-modal Prompt Tuning (CPT), a novel paradigm for tuning Vision-Language Models (VL-PTMs) CPT reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. Comprehensive experimental results show that prompt tuned VL-PTMs outperform their fine-tuned counterparts by a large margin.
Score: 101.5066760592534
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural language in image data, facilitating a broad variety of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for quantities of labeled data to stimulate the visual grounding capability of VL-PTMs for downstream tasks. To address the challenge, we present Cross-modal Prompt Tuning (CPT, alternatively, Colorful Prompt Tuning), a novel paradigm for tuning VL-PTMs, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, our prompt tuning approach enables strong few-shot and even zero-shot visual grounding capabilities of VL-PTMs. Comprehensive experimental results show that prompt tuned VL-PTMs outperform their fine-tuned counterparts by a large margin (e.g., 17.3% absolute accuracy improvement, and 73.8% relative standard deviation reduction on average with one shot in RefCOCO evaluation). All the data and code will be available to facilitate future research.

Related papers

Noise is an Efficient Learner for Zero-Shot Vision-Language Models [9.519280130389935]
Test-Time Noise Tuning is a novel method for handling unpredictable shifts in the visual space. We introduce a novel approach for inter-view representation alignment by explicitly enforcing coherence in embedding. These improvements lay a strong foundation for adaptive out-of-distribution handling.
arXiv Detail & Related papers (2025-02-09T20:25:42Z)
Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning [5.242869847419834]
Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data. This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training model through adaptive prompt tuning.
arXiv Detail & Related papers (2024-12-19T08:51:01Z)
Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models. Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information. Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z)
CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task [15.642102189777072]
Cross Visual Prompt Tuning is a new type of visual fine-tuning. CVPT calculates cross-attention between the prompt tokens and the embedded tokens, which allows us to compute the semantic relationship between them. CVPT significantly improves VPT's performance and efficiency in visual tasks.
arXiv Detail & Related papers (2024-08-27T11:07:19Z)
Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks. We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z)
Unified Vision and Language Prompt Learning [86.1530128487077]
We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning. A major finding is that text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities.
arXiv Detail & Related papers (2022-10-13T17:50:24Z)
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models [107.05966685291067]
We propose test-time prompt tuning (TPT) to learn adaptive prompts on the fly with a single test sample. TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average. In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
arXiv Detail & Related papers (2022-09-15T17:55:11Z)
Pro-tuning: Unified Prompt Tuning for Vision Tasks [133.12978197265596]
Fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks. In this work, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
arXiv Detail & Related papers (2022-07-28T21:09:31Z)
Declaration-based Prompt Tuning for Visual Question Answering [16.688288454811016]
We propose an innovative visual-language (VL) fine-tuning paradigm (named Declaration-based Prompt Tuning, abbreviated as DPT) DPT jointly optimize the objectives of pre-training and fine-tuning of VQA model, boosting the effective adaptation of pre-trained VL models to the downstream task. Experimental results on GQA dataset show that DPT outperforms the fine-tuned counterpart by a large margin regarding accuracy in both fully-supervised (2.68%) and zero-shot/few-shot (over 31%) settings.
arXiv Detail & Related papers (2022-05-05T05:56:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.