Diversity Covariance-Aware Prompt Learning for Vision-Language Models
- URL: http://arxiv.org/abs/2503.01531v1
- Date: Mon, 03 Mar 2025 13:40:43 GMT
- Title: Diversity Covariance-Aware Prompt Learning for Vision-Language Models
- Authors: Songlin Dong, Zhengdong Zhou, Chenhao Ding, Xinyuan Gao, Alex Kot, Yihong Gong,
- Abstract summary: We present a Diversity Covariance-Aware framework that learns distributional information from the data to enhance the few-shot ability of the prompt model.<n>We propose a covariance-aware method that models the covariance relationships between visual features and uses anisotropic Mahalanobis distance, instead of the suboptimal cosine distance, to measure the similarity between two modalities.<n>Then, we propose the diversity-aware method, which learns multiple diverse soft prompts to capture different attributes of categories and aligns them independently with visual modalities.
- Score: 12.40822956896241
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt tuning can further enhance the performance of visual-language models across various downstream tasks (e.g., few-shot learning), enabling them to better adapt to specific applications and needs. In this paper, we present a Diversity Covariance-Aware framework that learns distributional information from the data to enhance the few-shot ability of the prompt model. First, we propose a covariance-aware method that models the covariance relationships between visual features and uses anisotropic Mahalanobis distance, instead of the suboptimal cosine distance, to measure the similarity between two modalities. We rigorously derive and prove the validity of this modeling process. Then, we propose the diversity-aware method, which learns multiple diverse soft prompts to capture different attributes of categories and aligns them independently with visual modalities. This method achieves multi-centered covariance modeling, leading to more diverse decision boundaries. Extensive experiments on 11 datasets in various tasks demonstrate the effectiveness of our method.
Related papers
- Evaluating the Diversity and Quality of LLM Generated Content [72.84945252821908]
We introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds.
Although preference-tuned models exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models.
These findings have important implications for applications that require diverse yet high-quality outputs.
arXiv Detail & Related papers (2025-04-16T23:02:23Z) - XR-VLM: Cross-Relationship Modeling with Multi-part Prompts and Visual Features for Fine-Grained Recognition [20.989787824067143]
XR-VLM is a novel mechanism to discover subtle differences by modeling cross-relationships.
We develop a multi-part prompt learning module to capture multi-perspective descriptions.
Our method achieves significant improvements compared to current state-of-the-art approaches.
arXiv Detail & Related papers (2025-03-10T08:58:05Z) - Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric [48.81957145701228]
We propose a new diversity metric based on sample-level "novelty"
We show that NovelSum accurately captures diversity variations and achieves a 0.97 correlation with instruction-tuned model performance.
arXiv Detail & Related papers (2025-02-24T14:20:22Z) - Prompt as Free Lunch: Enhancing Diversity in Source-Free Cross-domain Few-shot Learning through Semantic-Guided Prompting [9.116108409344177]
The source-free cross-domain few-shot learning task aims to transfer pretrained models to target domains utilizing minimal samples.<n>We propose the SeGD-VPT framework, which is divided into two phases.<n>The first step aims to increase feature diversity by adding diversity prompts to each support sample, thereby generating varying input and enhancing sample diversity.
arXiv Detail & Related papers (2024-12-01T11:00:38Z) - OneDiff: A Generalist Model for Image Difference Captioning [5.71214984158106]
Image Difference Captioning (IDC) is crucial for accurately describing variations between closely related images.
OneDiff is a novel generalist approach that utilizes a robust vision-language model architecture.
OneDiff consistently outperforms existing state-of-the-art models in accuracy and adaptability.
arXiv Detail & Related papers (2024-07-08T06:14:37Z) - Revisiting Demonstration Selection Strategies in In-Context Learning [66.11652803887284]
Large language models (LLMs) have shown an impressive ability to perform a wide range of tasks using in-context learning (ICL)
In this work, we first revisit the factors contributing to this variance from both data and model aspects, and find that the choice of demonstration is both data- and model-dependent.
We propose a data- and model-dependent demonstration selection method, textbfTopK + ConE, based on the assumption that textitthe performance of a demonstration positively correlates with its contribution to the model's understanding of the test samples.
arXiv Detail & Related papers (2024-01-22T16:25:27Z) - Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts
in Underspecified Visual Tasks [92.32670915472099]
We propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs)
We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.
arXiv Detail & Related papers (2023-10-03T17:37:52Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.