DPL: Decoupled Prompt Learning for Vision-Language Models
- URL: http://arxiv.org/abs/2308.10061v1
- Date: Sat, 19 Aug 2023 15:48:38 GMT
- Title: DPL: Decoupled Prompt Learning for Vision-Language Models
- Authors: Chen Xu, Yuhan Zhu, Guozhen Zhang, Haocheng Shen, Yixuan Liao, Xiaoxin
Chen, Gangshan Wu, Limin Wang
- Abstract summary: We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem.
Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
- Score: 41.90997623029582
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prompt learning has emerged as an efficient and effective approach for
transferring foundational Vision-Language Models (e.g., CLIP) to downstream
tasks. However, current methods tend to overfit to seen categories, thereby
limiting their generalization ability for unseen classes. In this paper, we
propose a new method, Decoupled Prompt Learning (DPL), which reformulates the
attention in prompt learning to alleviate this problem. Specifically, we
theoretically investigate the collaborative process between prompts and
instances (i.e., image patches/text tokens) by reformulating the original
self-attention into four separate sub-processes. Through detailed analysis, we
observe that certain sub-processes can be strengthened to bolster robustness
and generalizability by some approximation techniques. Furthermore, we
introduce language-conditioned textual prompting based on decoupled attention
to naturally preserve the generalization of text input. Our approach is
flexible for both visual and textual modalities, making it easily extendable to
multi-modal prompt learning. By combining the proposed techniques, our approach
achieves state-of-the-art performance on three representative benchmarks
encompassing 15 image recognition datasets, while maintaining
parameter-efficient. Moreover, our DPL does not rely on any auxiliary
regularization task or extra training data, further demonstrating its
remarkable generalization ability.
Related papers
- Instructing Prompt-to-Prompt Generation for Zero-Shot Learning [116.33775552866476]
We propose a textbfPrompt-to-textbfPrompt generation methodology (textbfP2P) to distill instructive visual prompts for transferable knowledge discovery.
The core of P2P is to mine semantic-related instruction from prompt-conditioned visual features and text instruction on modal-sharing semantic concepts.
arXiv Detail & Related papers (2024-06-05T07:59:48Z) - Concept-Guided Prompt Learning for Generalization in Vision-Language
Models [33.361744437967126]
We propose Concept-Guided Prompt Learning for vision-language models.
We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache.
In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
arXiv Detail & Related papers (2024-01-15T04:04:47Z) - COMMA: Co-Articulated Multi-Modal Learning [39.778958624066185]
We propose Co-Articulated Multi-Modal Learning (COMMA) to handle the limitations of previous methods.
Our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches.
We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts.
arXiv Detail & Related papers (2023-12-30T15:47:36Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Text-driven Prompt Generation for Vision-Language Models in Federated
Learning [24.005620820818756]
Our work proposes Federated Text-driven Prompt Generation (FedTPG)
FedTPG learns a unified prompt generation network across multiple remote clients in a scalable manner.
Our comprehensive empirical evaluations on nine diverse image classification datasets show that our method is superior to existing federated prompt learning methods.
arXiv Detail & Related papers (2023-10-09T19:57:24Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of
Vision & Language Models [67.19124099815645]
We propose a novel Language-Aware Soft Prompting (LASP) learning method to alleviate base class overfitting.
LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available.
LASP matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
arXiv Detail & Related papers (2022-10-03T17:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.