Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
- URL: http://arxiv.org/abs/2208.08340v4
- Date: Fri, 7 Jul 2023 04:14:37 GMT
- Title: Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
- Authors: Yinghui Xing, Qirui Wu, De Cheng, Shizhou Zhang, Guoqiang Liang, Peng
Wang, Yanning Zhang
- Abstract summary: We propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously.
To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning scheme is proposed.
- Score: 39.722927180264584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the emergence of large pre-trained vison-language model like CLIP,
transferable representations can be adapted to a wide range of downstream tasks
via prompt tuning. Prompt tuning tries to probe the beneficial information for
downstream tasks from the general knowledge stored in the pre-trained model. A
recently proposed method named Context Optimization (CoOp) introduces a set of
learnable vectors as text prompt from the language side. However, tuning the
text prompt alone can only adjust the synthesized "classifier", while the
computed visual features of the image encoder can not be affected , thus
leading to sub-optimal solutions. In this paper, we propose a novel
Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual
prompts simultaneously. To make the final image feature concentrate more on the
target visual concept, a Class-Aware Visual Prompt Tuning (CAVPT) scheme is
further proposed in our DPT, where the class-aware visual prompt is generated
dynamically by performing the cross attention between text prompts features and
image patch token embeddings to encode both the downstream task-related
information and visual instance information. Extensive experimental results on
11 datasets demonstrate the effectiveness and generalization ability of the
proposed method. Our code is available in https://github.com/fanrena/DPT.
Related papers
- TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt [15.259819430801402]
We propose a pseudo-visual prompt(PVP) module for implicit visual prompt tuning to address this problem.
Specifically, we first learn the pseudo-visual prompt for each category, mining diverse visual knowledge by the well-aligned space of pre-trained vision-language models.
Experimental results on VOC2007, MS-COCO, and NUSWIDE datasets demonstrate that our method can surpass state-of-the-art(SOTA) methods.
arXiv Detail & Related papers (2024-05-11T06:11:42Z) - LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for
Vision-Language Models [28.983503845298824]
We show that synthetic text images are good visual prompts for vision-language models!
We propose our LoGoPrompt, which reformulates the classification objective to the visual prompt selection.
Our method consistently outperforms state-of-the-art methods in few-shot learning, base-to-new generalization, and domain generalization.
arXiv Detail & Related papers (2023-09-03T12:23:33Z) - MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models [12.397136690734865]
We propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT.
MuDPT extends independent multi-modal prompt tuning by learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion.
Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin.
arXiv Detail & Related papers (2023-06-20T09:15:52Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - Texts as Images in Prompt Tuning for Multi-Label Image Recognition [70.9310322461598]
We advocate that image-text contrastive learning makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting.
Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning.
Our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-11-23T07:00:11Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z) - Unified Vision and Language Prompt Learning [86.1530128487077]
We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning.
A major finding is that text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances.
To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities.
arXiv Detail & Related papers (2022-10-13T17:50:24Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.