Conditional Prompt Learning for Vision-Language Models
- URL: http://arxiv.org/abs/2203.05557v1
- Date: Thu, 10 Mar 2022 18:59:41 GMT
- Title: Conditional Prompt Learning for Vision-Language Models
- Authors: Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu
- Abstract summary: A recently proposed method named Context Optimization (CoOp) turns context words in a prompt into a set of learnable vectors.
CoOp generalizes much better than CoOp to unseen classes, even showing promising transferability beyond a single dataset.
Our experiments show that CoCoOp generalizes much better than CoOp to unseen classes, even showing promising transferability beyond a single dataset.
- Score: 107.06776396086471
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rise of powerful pre-trained vision-language models like CLIP, it
becomes essential to investigate ways to adapt these models to downstream
datasets. A recently proposed method named Context Optimization (CoOp)
introduces the concept of prompt learning -- a recent trend in NLP -- to the
vision domain for adapting pre-trained vision-language models. Specifically,
CoOp turns context words in a prompt into a set of learnable vectors and, with
only a few labeled images for learning, can achieve huge improvements over
intensively-tuned manual prompts. In our study we identify a critical problem
of CoOp: the learned context is not generalizable to wider unseen classes
within the same dataset, suggesting that CoOp overfits base classes observed
during training. To address the problem, we propose Conditional Context
Optimization (CoCoOp), which extends CoOp by further learning a lightweight
neural network to generate for each image an input-conditional token (vector).
Compared to CoOp's static prompts, our dynamic prompts adapt to each instance
and are thus less sensitive to class shift. Extensive experiments show that
CoCoOp generalizes much better than CoOp to unseen classes, even showing
promising transferability beyond a single dataset; and yields stronger domain
generalization performance as well. Code is available at
https://github.com/KaiyangZhou/CoOp.
Related papers
- Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning [94.52149969720712]
IntCoOp learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning.
IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
arXiv Detail & Related papers (2024-06-19T16:37:31Z) - AAPL: Adding Attributes to Prompt Learning for Vision-Language Models [6.32186874112557]
We propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts.
We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.
arXiv Detail & Related papers (2024-04-25T17:51:10Z) - Vocabulary-Defined Semantics: Latent Space Clustering for Improving In-Context Learning [32.178931149612644]
In-context learning enables language models to adapt to downstream data or incorporate tasks by few samples as demonstrations within the prompts.
However, the performance of in-context learning can be unstable depending on the quality, format, or order of demonstrations.
We propose a novel approach "vocabulary-defined semantics"
arXiv Detail & Related papers (2024-01-29T14:29:48Z) - PRE: Vision-Language Prompt Learning with Reparameterization Encoder [24.855142164168605]
Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks.
To attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions.
To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens.
arXiv Detail & Related papers (2023-09-14T14:48:01Z) - Understanding and Mitigating Overfitting in Prompt Tuning for
Vision-Language Models [108.13378788663196]
We propose Subspace Prompt Tuning (SubPT) to project the gradients in back-propagation onto the low-rank subspace spanned by the early-stage gradient flow eigenvectors during the entire training process.
We equip CoOp with Novel Learner Feature (NFL) to enhance the generalization ability of the learned prompts onto novel categories beyond the training set.
arXiv Detail & Related papers (2022-11-04T02:06:22Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Learning to Prompt for Vision-Language Models [82.25005817904027]
Vision-language pre-training has emerged as a promising alternative for representation learning.
It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders.
Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks.
arXiv Detail & Related papers (2021-09-02T17:57:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.