Learning to Prompt for Vision-Language Models
- URL: http://arxiv.org/abs/2109.01134v1
- Date: Thu, 2 Sep 2021 17:57:31 GMT
- Title: Learning to Prompt for Vision-Language Models
- Authors: Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu
- Abstract summary: Vision-language pre-training has emerged as a promising alternative for representation learning.
It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders.
Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks.
- Score: 82.25005817904027
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language pre-training has recently emerged as a promising alternative
for representation learning. It shifts from the tradition of using images and
discrete labels for learning a fixed set of weights, seen as visual concepts,
to aligning images and raw text for two separate encoders. Such a paradigm
benefits from a broader source of supervision and allows zero-shot transfer to
downstream tasks since visual concepts can be diametrically generated from
natural language, known as prompt. In this paper, we identify that a major
challenge of deploying such models in practice is prompt engineering. This is
because designing a proper prompt, especially for context words surrounding a
class name, requires domain expertise and typically takes a significant amount
of time for words tuning since a slight change in wording could have a huge
impact on performance. Moreover, different downstream tasks require specific
designs, further hampering the efficiency of deployment. To overcome this
challenge, we propose a novel approach named context optimization (CoOp). The
main idea is to model context in prompts using continuous representations and
perform end-to-end learning from data while keeping the pre-trained parameters
fixed. In this way, the design of task-relevant prompts can be fully automated.
Experiments on 11 datasets show that CoOp effectively turns pre-trained
vision-language models into data-efficient visual learners, requiring as few as
one or two shots to beat hand-crafted prompts with a decent margin and able to
gain significant improvements when using more shots (e.g., at 16 shots the
average gain is around 17% with the highest reaching over 50%). CoOp also
exhibits strong robustness to distribution shift.
Related papers
- Revisiting Prompt Pretraining of Vision-Language Models [13.888505919946578]
We propose a general framework termed Revisiting Prompt Pretraining (RPP)
RPP targets at improving the fitting and generalization ability from two aspects: prompt structure and prompt supervision.
We additionally utilize soft labels derived from zero-shot probability predictions provided by a pretrained Contrastive Language Image Pretraining (CLIP) teacher model.
arXiv Detail & Related papers (2024-09-10T02:36:13Z) - IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning [94.52149969720712]
IntCoOp learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning.
IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
arXiv Detail & Related papers (2024-06-19T16:37:31Z) - Text as Image: Learning Transferable Adapter for Multi-Label
Classification [13.11583340598517]
We introduce an effective approach to employ large language models for multi-label instruction-following text generation.
In this way, a fully automated pipeline for visual label recognition is developed without relying on any manual data.
arXiv Detail & Related papers (2023-12-07T09:22:20Z) - PRE: Vision-Language Prompt Learning with Reparameterization Encoder [24.855142164168605]
Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks.
To attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions.
To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens.
arXiv Detail & Related papers (2023-09-14T14:48:01Z) - POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained
models [62.23255433487586]
We propose an unsupervised fine-tuning framework to fine-tune the model or prompt on the unlabeled target data.
We demonstrate how to apply our method to both language-augmented vision and masked-language models by aligning the discrete distributions extracted from the prompts and target data.
arXiv Detail & Related papers (2023-04-29T22:05:22Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models [48.77653835765705]
We introduce a probabilistic resolution to prompt tuning, where the label-specific prompts are generated hierarchically by first sampling a latent vector from an underlying distribution and then employing a lightweight generative model.
We evaluate the effectiveness of our approach on four tasks: few-shot image recognition, base-to-new generalization, dataset transfer learning, and domain shifts.
arXiv Detail & Related papers (2023-03-16T06:09:15Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.