Related papers: Generalizable Prompt Tuning for Vision-Language Models

Generalizable Prompt Tuning for Vision-Language Models

URL: http://arxiv.org/abs/2410.03189v3
Date: Wed, 22 Jan 2025 07:58:59 GMT
Title: Generalizable Prompt Tuning for Vision-Language Models
Authors: Qian Zhang,
Abstract summary: Learnable soft prompts often perform well in downstream tasks but lack generalizability.<n>The study shows that by treating soft and hand-crafted prompts as dual views of the textual modality, we can better ensemble task-specific and general semantic information.<n>To generate more expressive prompts, the study introduces a class-wise augmentation from the visual modality, resulting in significant robustness to a wider range of unseen classes.
Score: 3.1008306011364644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prompt tuning for vision-language models such as CLIP involves optimizing the text prompts used to generate image-text pairs for specific downstream tasks. While hand-crafted or template-based prompts are generally applicable to a wider range of unseen classes, they tend to perform poorly in downstream tasks (i.e., seen classes). Learnable soft prompts, on the other hand, often perform well in downstream tasks but lack generalizability. Additionally, prior research has predominantly concentrated on the textual modality, with very few studies attempting to explore the prompt's generalization potential from the visual modality. Keeping these limitations in mind, we investigate how to prompt tuning to obtain both a competitive downstream performance and generalization. The study shows that by treating soft and hand-crafted prompts as dual views of the textual modality, and maximizing their mutual information, we can better ensemble task-specific and general semantic information. Moreover, to generate more expressive prompts, the study introduces a class-wise augmentation from the visual modality, resulting in significant robustness to a wider range of unseen classes. Extensive evaluations on several benchmarks report that the proposed approach achieves competitive results in terms of both task-specific performance and general abilities.

Related papers

A Similarity Paradigm Through Textual Regularization Without Forgetting [17.251684463032433]
We propose a novel method called Similarity Paradigm with Textual Regularization (SPTR) for prompt learning without forgetting. SPTR is a two-pronged design based on hand-crafted prompts that is an inseparable framework. Four representative tasks across 11 datasets demonstrate that SPTR outperforms existing prompt learning methods.
arXiv Detail & Related papers (2025-02-20T09:06:44Z)
DesCLIP: Robust Continual Adaptation via General Attribute Descriptions for Pretrained Vision-Language Models [13.917530818500481]
Continual adaptation of vision-language models (VLMs) focuses on leveraging cross-modal pretrained knowledge to incrementally adapt for expanding downstream tasks and datasets. Existing research often focuses on connecting visual features with specific class text in downstream tasks, overlooking the latent relationships between general and specialized knowledge. We propose DesCLIP, which leverages general attribute (GA) descriptions to guide the understanding of specific class objects.
arXiv Detail & Related papers (2025-02-02T01:06:02Z)
Revisiting Prompt Pretraining of Vision-Language Models [13.888505919946578]
We propose a general framework termed Revisiting Prompt Pretraining (RPP) RPP targets at improving the fitting and generalization ability from two aspects: prompt structure and prompt supervision. We additionally utilize soft labels derived from zero-shot probability predictions provided by a pretrained Contrastive Language Image Pretraining (CLIP) teacher model.
arXiv Detail & Related papers (2024-09-10T02:36:13Z)
Instructing Prompt-to-Prompt Generation for Zero-Shot Learning [116.33775552866476]
We propose a textbfPrompt-to-textbfPrompt generation methodology (textbfP2P) to distill instructive visual prompts for transferable knowledge discovery. The core of P2P is to mine semantic-related instruction from prompt-conditioned visual features and text instruction on modal-sharing semantic concepts.
arXiv Detail & Related papers (2024-06-05T07:59:48Z)
COMMA: Co-Articulated Multi-Modal Learning [39.778958624066185]
We propose Co-Articulated Multi-Modal Learning (COMMA) to handle the limitations of previous methods. Our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches. We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts.
arXiv Detail & Related papers (2023-12-30T15:47:36Z)
Tuning Multi-mode Token-level Prompt Alignment across Modalities [48.39511580746271]
We propose a multi-mode token-level tuning framework to learn and align a set of prompt tokens across modalities. Specifically, we rely on two essential factors: 1) multi-mode prompts discovery, which guarantees diverse semantic representations, and 2) token-level alignment, which helps explore fine-grained similarity. Experiments on popular image recognition benchmarks show the superior generalization and few-shot abilities of our approach.
arXiv Detail & Related papers (2023-09-25T03:20:09Z)
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models [64.24227572048075]
We propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models. Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects.
arXiv Detail & Related papers (2023-08-22T04:24:45Z)
CoPL: Contextual Prompt Learning for Vision-Language Understanding [21.709017504227823]
We propose a Contextual Prompt Learning (CoPL) framework, capable of aligning the prompts to the localized features of the image. Our key innovations over earlier works include using local image features as part of the prompt learning process, and more crucially, learning to weight these prompts based on local features that are appropriate for the task at hand. Our method produces substantially improved performance when compared to the current state of the art methods.
arXiv Detail & Related papers (2023-07-03T10:14:33Z)
On the Role of Attention in Prompt-tuning [90.97555030446563]
We study prompt-tuning for one-layer attention architectures and study contextual mixture-models. We show that softmax-prompt-attention is provably more expressive than softmax-self-attention and linear-prompt-attention. We also provide experiments that verify our theoretical insights on real datasets and demonstrate how prompt-tuning enables the model to attend to context-relevant information.
arXiv Detail & Related papers (2023-06-06T06:23:38Z)
Visual-Language Prompt Tuning with Knowledge-guided Context Optimization [96.27531485377871]
Representative CoOp-based work combines the learnable textual tokens with the class tokens to obtain specific textual knowledge. We introduce a novel Knowledge-guided Context Optimization (KgCoOp) to enhance the generalization ability of the learnable prompt for unseen classes.
arXiv Detail & Related papers (2023-03-23T14:04:23Z)
Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category. We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP. Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z)
MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z)
Prompt Learning with Optimal Transport for Vision-Language Models [25.928455328563402]
We learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts. To solve this problem, we propose to apply optimal transport to match the vision and text modalities. In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data.
arXiv Detail & Related papers (2022-10-03T22:21:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.