RESTORE: Towards Feature Shift for Vision-Language Prompt Learning
- URL: http://arxiv.org/abs/2403.06136v1
- Date: Sun, 10 Mar 2024 08:52:48 GMT
- Title: RESTORE: Towards Feature Shift for Vision-Language Prompt Learning
- Authors: Yuncheng Yang and Chuyan Zhang and Zuopeng Yang and Yuting Gao and
Yulei Qin and Ke Li and Xing Sun and Jie Yang and Yun Gu
- Abstract summary: We show that prompt tuning along only one branch of CLIP is the reason why the misalignment occurs.
Without proper regularization across the learnable parameters in different modalities, prompt learning violates the original pre-training constraints.
We propose RESTORE, a multi-modal prompt learning method that exerts explicit constraints on cross-modal consistency.
- Score: 33.13407089704543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt learning is effective for fine-tuning foundation models to improve
their generalization across a variety of downstream tasks. However, the prompts
that are independently optimized along a single modality path, may sacrifice
the vision-language alignment of pre-trained models in return for improved
performance on specific tasks and classes, leading to poorer generalization. In
this paper, we first demonstrate that prompt tuning along only one single
branch of CLIP (e.g., language or vision) is the reason why the misalignment
occurs. Without proper regularization across the learnable parameters in
different modalities, prompt learning violates the original pre-training
constraints inherent in the two-tower architecture. To address such
misalignment, we first propose feature shift, which is defined as the variation
of embeddings after introducing the learned prompts, to serve as an explanatory
tool. We dive into its relation with generalizability and thereafter propose
RESTORE, a multi-modal prompt learning method that exerts explicit constraints
on cross-modal consistency. To be more specific, to prevent feature
misalignment, a feature shift consistency is introduced to synchronize
inter-modal feature shifts by measuring and regularizing the magnitude of
discrepancy during prompt tuning. In addition, we propose a "surgery" block to
avoid short-cut hacking, where cross-modal misalignment can still be severe if
the feature shift of each modality varies drastically at the same rate. It is
implemented as feed-forward adapters upon both modalities to alleviate the
misalignment problem. Extensive experiments on 15 datasets demonstrate that our
method outperforms the state-of-the-art prompt tuning methods without
compromising feature alignment.
Related papers
- Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL)
This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features.
In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z) - One Size Fits All for Semantic Shifts: Adaptive Prompt Tuning for Continual Learning [41.395573635020604]
We propose an adaptive prompting approach that accommodates semantic shifts of varying degree where mild and abrupt shifts are mixed.
AdaPromptCL employs the assign-and-refine semantic grouping mechanism that dynamically manages prompt groups.
Experiment results demonstrate that AdaPromptCL outperforms existing prompting methods by up to 21.3%.
arXiv Detail & Related papers (2023-11-18T08:55:08Z) - Self-regulating Prompts: Foundational Model Adaptation without
Forgetting [112.66832145320434]
We introduce a self-regularization framework for prompting called PromptSRC.
PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations.
arXiv Detail & Related papers (2023-07-13T17:59:35Z) - Consistency-guided Prompt Learning for Vision-Language Models [23.4909421082857]
We propose Consistency-guided Prompt learning (CoPrompt), a new fine-tuning method for vision-language models.
Our approach improves the generalization of large foundation models when fine-tuned on downstream tasks in a few-shot setting.
arXiv Detail & Related papers (2023-06-01T23:20:47Z) - Gradient-Regulated Meta-Prompt Learning for Generalizable
Vision-Language Models [137.74524357614285]
We introduce a novel Gradient-RegulAted Meta-prompt learning framework.
It helps pre-training models adapt to downstream tasks in a parameter -- and data -- efficient way.
GRAM can be easily incorporated into various prompt tuning methods in a model-agnostic way.
arXiv Detail & Related papers (2023-03-12T05:03:37Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Amortised Invariance Learning for Contrastive Self-Supervision [11.042648980854485]
We introduce the notion of amortised invariance learning for contrastive self supervision.
We show that our amortised features provide a reliable way to learn diverse downstream tasks with different invariance requirements.
This provides an exciting perspective that opens up new horizons in the field of general purpose representation learning.
arXiv Detail & Related papers (2023-02-24T16:15:11Z) - Bayesian Prompt Learning for Image-Language Model Generalization [64.50204877434878]
We use the regularization ability of Bayesian methods to frame prompt learning as a variational inference problem.
Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts.
We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space.
arXiv Detail & Related papers (2022-10-05T17:05:56Z) - Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.