Distribution-Aware Prompt Tuning for Vision-Language Models
- URL: http://arxiv.org/abs/2309.03406v1
- Date: Wed, 6 Sep 2023 23:49:11 GMT
- Title: Distribution-Aware Prompt Tuning for Vision-Language Models
- Authors: Eulrang Cho, Jooyeon Kim, Hyunwoo J. Kim
- Abstract summary: A key to prompt tuning is the feature space alignment between two modalities via learnable vectors with model parameters fixed.
Inspired by this observation, we proposed distribution-aware prompt tuning (DAPT) for vision-language models.
Our experiments on 11 benchmark datasets demonstrate that our method significantly improves generalizability.
- Score: 20.02599087680773
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained vision-language models (VLMs) have shown impressive performance
on various downstream tasks by utilizing knowledge learned from large data. In
general, the performance of VLMs on target tasks can be further improved by
prompt tuning, which adds context to the input image or text. By leveraging
data from target tasks, various prompt-tuning methods have been studied in the
literature. A key to prompt tuning is the feature space alignment between two
modalities via learnable vectors with model parameters fixed. We observed that
the alignment becomes more effective when embeddings of each modality are
`well-arranged' in the latent space. Inspired by this observation, we proposed
distribution-aware prompt tuning (DAPT) for vision-language models, which is
simple yet effective. Specifically, the prompts are learned by maximizing
inter-dispersion, the distance between classes, as well as minimizing the
intra-dispersion measured by the distance between embeddings from the same
class. Our extensive experiments on 11 benchmark datasets demonstrate that our
method significantly improves generalizability. The code is available at
https://github.com/mlvlab/DAPT.
Related papers
- Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost.
Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z) - Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning [13.964106147449051]
Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets.
We propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT)
We demonstrate that our new approximations with semantic information are superior to representative capabilities.
arXiv Detail & Related papers (2024-02-04T04:42:05Z) - Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt
Learning with Data-Dependent Prior [14.232144691524528]
Recent Vision-Language Pretrained models have become the backbone for many downstream tasks.
MLE training can lead the context vector to over-fit dominant image features in the training data.
This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application.
arXiv Detail & Related papers (2024-01-09T10:15:59Z) - LAMM: Label Alignment for Multi-Modal Prompt Learning [17.478967970736115]
We introduce an innovative label alignment method named textbfLAMM, which can adjust the category embeddings of downstream datasets through end-to-end training.
Our method significantly improves the performance of existing multi-modal prompt learning models in few-shot scenarios.
Our methodology exhibits the preeminence in continual learning compared to other prompt tuning methods.
arXiv Detail & Related papers (2023-12-13T15:29:52Z) - M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - Dynamic Prompting: A Unified Framework for Prompt Tuning [33.175097465669374]
We present a unified dynamic prompt (DP) tuning strategy that dynamically determines different factors of prompts based on specific tasks and instances.
Experimental results underscore the significant performance improvement achieved by dynamic prompt tuning across a wide range of tasks.
We establish the universal applicability of our approach under full-data, few-shot, and multitask scenarios.
arXiv Detail & Related papers (2023-03-06T06:04:46Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - Unified Vision and Language Prompt Learning [86.1530128487077]
We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning.
A major finding is that text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances.
To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities.
arXiv Detail & Related papers (2022-10-13T17:50:24Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.