PLPP: Prompt Learning with Perplexity Is Self-Distillation for Vision-Language Models
- URL: http://arxiv.org/abs/2412.15277v1
- Date: Wed, 18 Dec 2024 03:08:53 GMT
- Title: PLPP: Prompt Learning with Perplexity Is Self-Distillation for Vision-Language Models
- Authors: Biao Liu, Wenyi Fang, Xiaoyu Wu, Yang Zheng, Zheng Hu, Bo Yuan,
- Abstract summary: We propose a plug-in prompt-regularization method called PLPP, which use perplexity loss to regularize prompt learning.<n>Experiments conducted on four classification tasks indicate that PLPP exhibits superior performance compared to existing methods.
- Score: 8.480318790780037
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained Vision-Language (VL) models such as CLIP have demonstrated their excellent performance across numerous downstream tasks. A recent method, Context Optimization (CoOp), further improves the performance of VL models on downstream tasks by introducing prompt learning. CoOp optimizes a set of learnable vectors, aka prompt, and freezes the whole CLIP model. However, relying solely on CLIP loss to fine-tune prompts can lead to models that are prone to overfitting on downstream task. To address this issue, we propose a plug-in prompt-regularization method called PLPP (Prompt Learning with PerPlexity), which use perplexity loss to regularize prompt learning. PLPP designs a two-step operation to compute the perplexity for prompts: (a) calculating cosine similarity between the weight of the embedding layer and prompts to get labels, (b) introducing a language model (LM) head that requires no training behind text encoder to output word probability distribution. Meanwhile, we unveil that the essence of PLPP is inherently a form of self-distillation. To further prevent overfitting as well as to reduce the additional computation introduced by PLPP, we turn the hard label to soft label and choose top-$k$ values for calculating the perplexity loss. For accelerating model convergence, we introduce mutual self-distillation learning, that is perplexity and inverted perplexity loss. The experiments conducted on four classification tasks indicate that PLPP exhibits superior performance compared to existing methods.
Related papers
- FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation [17.51747913191231]
We propose large textbfFaster large textbfDistillation-large textbfBased large textbfPrompt large textbfLL (textbfFDBPL)<n>It addresses issues by sharing soft supervision contexts across multiple training stages and implementing accelerated I/O. Comprehensive evaluations across 11 datasets demonstrate superior performance in base-to-new generalization, cross-dataset transfer, and robustness tests, achieving $2.2times$ faster training speed.
arXiv Detail & Related papers (2025-05-23T15:57:16Z) - Post-pre-training for Modality Alignment in Vision-Language Foundation Models [12.110530026601968]
This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning.
It aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations.
arXiv Detail & Related papers (2025-04-17T07:46:19Z) - Zero-Shot Class Unlearning in CLIP with Synthetic Samples [0.0]
We focus on unlearning within CLIP, a dual vision-language model trained on a massive dataset of image-text pairs.<n>We apply Lipschitz regularization to the multimodal context of CLIP.<n>Our forgetting procedure is iterative, where we track accuracy on a synthetic forget set and stop when accuracy falls below a chosen threshold.
arXiv Detail & Related papers (2024-07-10T09:16:14Z) - LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP [20.86307407685542]
Linear Probe (LP) has been often reported as a weak baseline for few-shot CLIP adaptation.
In this work, we examine from convex-optimization perspectives a generalization of the standard LP baseline.
Our image-language objective function, along with these non-trivial optimization insights and ingredients, yields, surprisingly, highly competitive few-shot CLIP performances.
arXiv Detail & Related papers (2024-04-02T20:23:10Z) - ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning [54.68180752416519]
Panoptic segmentation is a cutting-edge computer vision task.
We introduce a novel and efficient method for continual panoptic segmentation based on Visual Prompt Tuning, dubbed ECLIPSE.
Our approach involves freezing the base model parameters and fine-tuning only a small set of prompt embeddings, addressing both catastrophic forgetting and plasticity.
arXiv Detail & Related papers (2024-03-29T11:31:12Z) - Match me if you can: Semi-Supervised Semantic Correspondence Learning with Unpaired Images [76.47980643420375]
This paper builds on the hypothesis that there is an inherent data-hungry matter in learning semantic correspondences.
We demonstrate a simple machine annotator reliably enriches paired key points via machine supervision.
Our models surpass current state-of-the-art models on semantic correspondence learning benchmarks like SPair-71k, PF-PASCAL, and PF-WILLOW.
arXiv Detail & Related papers (2023-11-30T13:22:15Z) - Do Compressed LLMs Forget Knowledge? An Experimental Study with
Practical Implications [63.29358103217275]
Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks.
We propose two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after compression.
We introduce a variant called Inference-time Dynamic Prompting (IDP) that can effectively increase prompt diversity without incurring any inference overhead.
arXiv Detail & Related papers (2023-10-02T03:12:06Z) - Self-regulating Prompts: Foundational Model Adaptation without
Forgetting [112.66832145320434]
We introduce a self-regularization framework for prompting called PromptSRC.
PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations.
arXiv Detail & Related papers (2023-07-13T17:59:35Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in
Vision-Language Models [76.410400238974]
We propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident.
A CLIP model is adopted as the reward model during TTA and provides feedback for the VLM.
The proposed textitreinforcement learning with CLIP feedback(RLCF) framework is highly flexible and universal.
arXiv Detail & Related papers (2023-05-29T11:03:59Z) - Black Box Few-Shot Adaptation for Vision-Language models [41.49584259596654]
Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners.
We describe a black-box method for V-L few-shot adaptation that operates on pre-computed image and text features.
We propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain.
arXiv Detail & Related papers (2023-04-04T12:42:29Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of
Vision & Language Models [67.19124099815645]
We propose a novel Language-Aware Soft Prompting (LASP) learning method to alleviate base class overfitting.
LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available.
LASP matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
arXiv Detail & Related papers (2022-10-03T17:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.