Understanding Prompt Tuning for V-L Models Through the Lens of Neural
Collapse
- URL: http://arxiv.org/abs/2306.15955v3
- Date: Thu, 7 Sep 2023 07:34:21 GMT
- Title: Understanding Prompt Tuning for V-L Models Through the Lens of Neural
Collapse
- Authors: Didi Zhu, Zexi Li, Min Zhang, Junkun Yuan, Yunfeng Shao, Jiashuo Liu,
Kun Kuang, Yinchuan Li, Chao Wu
- Abstract summary: We propose Neural-collapse-anchored Prompt Tuning (NPT), a novel method that learns prompts with text and image representations.
NPT incorporates two regularization terms: language-modality collapse and multi-modality isomorphism; and it is compatible with other prompt tuning methods.
- Score: 47.89674843370092
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale vision-language (V-L) models have demonstrated remarkable
generalization capabilities for downstream tasks through prompt tuning.
However, the mechanisms behind the learned text representations are unknown,
limiting further generalization gains, especially under class imbalance
scenarios. Recent advances in the neural collapse (NC) phenomenon of
vision-only models suggest that the optimal representation structure is the
simplex ETF, which paves the way to study representations in V-L models. In
this paper, we make the first attempt to use NC for examining the
representations in V-L models via prompt tuning. It is found that NC optimality
of text-to-image representations shows a positive correlation with downstream
generalizability, which is more severe under class imbalance settings. To
improve the representations, we propose Neural-collapse-anchored Prompt Tuning
(NPT), a novel method that learns prompts with text and image representations
that satisfy the same simplex ETF. NPT incorporates two regularization terms:
language-modality collapse and multi-modality isomorphism; and it is compatible
with other prompt tuning methods. Extensive experiments show that NPT can
consistently help to improve existing prompt tuning techniques across 11
datasets for both balanced and imbalanced settings.
Related papers
- Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment [12.336161969869567]
We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization.<n>We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR.<n>Our results demonstrate that non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods.
arXiv Detail & Related papers (2026-01-31T10:57:46Z) - Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning [5.242869847419834]
Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data.
This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training model through adaptive prompt tuning.
arXiv Detail & Related papers (2024-12-19T08:51:01Z) - Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.
By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z) - Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL)
This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features.
In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z) - DRPT: Disentangled and Recurrent Prompt Tuning for Compositional
Zero-Shot Learning [15.580557941267095]
State and object primitives are deemed as learnable tokens of vocabulary embedded in prompts and tuned on seen compositions.
We develop a progressive fine-tuning procedure that allows for incremental updates to the prompts.
We quantify and analyze the entanglement in Compositional Zero-shot Learning.
arXiv Detail & Related papers (2023-05-02T07:42:47Z) - Unified Vision and Language Prompt Learning [86.1530128487077]
We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning.
A major finding is that text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances.
To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities.
arXiv Detail & Related papers (2022-10-13T17:50:24Z) - Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model [39.722927180264584]
We propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously.
To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning scheme is proposed.
arXiv Detail & Related papers (2022-08-17T15:06:36Z) - Prompt Tuning for Generative Multimodal Pretrained Models [75.44457974275154]
We implement prompt tuning on the unified sequence-to-sequence pretrained model adaptive to both understanding and generation tasks.
Experimental results demonstrate that the light-weight prompt tuning can achieve comparable performance with finetuning.
In comparison with finetuned models, the prompt-tuned models demonstrate improved robustness against adversarial attacks.
arXiv Detail & Related papers (2022-08-04T08:56:38Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.