Prompt Learning with Optimal Transport for Vision-Language Models
- URL: http://arxiv.org/abs/2210.01253v1
- Date: Mon, 3 Oct 2022 22:21:07 GMT
- Title: Prompt Learning with Optimal Transport for Vision-Language Models
- Authors: Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, Kun
Zhang
- Abstract summary: We learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts.
To solve this problem, we propose to apply optimal transport to match the vision and text modalities.
In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data.
- Score: 25.928455328563402
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the increasing attention to large vision-language models such as CLIP,
there has been a significant amount of effort dedicated to building efficient
prompts. Unlike conventional methods of only learning one single prompt, we
propose to learn multiple comprehensive prompts to describe diverse
characteristics of categories such as intrinsic attributes or extrinsic
contexts. However, directly matching each prompt to the same visual feature is
problematic, as it pushes the prompts to converge to one point. To solve this
problem, we propose to apply optimal transport to match the vision and text
modalities. Specifically, we first model images and the categories with visual
and textual feature sets. Then, we apply a two-stage optimization strategy to
learn the prompts. In the inner loop, we optimize the optimal transport
distance to align visual features and prompts by the Sinkhorn algorithm, while
in the outer loop, we learn the prompts by this distance from the supervised
data. Extensive experiments are conducted on the few-shot recognition task and
the improvement demonstrates the superiority of our method.
Related papers
- Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model [25.47573567479831]
We propose a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques.
Our method is out-of-the-box and does not require fine-tuning or optimization.
arXiv Detail & Related papers (2024-05-16T17:59:21Z) - Tuning Multi-mode Token-level Prompt Alignment across Modalities [48.39511580746271]
We propose a multi-mode token-level tuning framework to learn and align a set of prompt tokens across modalities.
Specifically, we rely on two essential factors: 1) multi-mode prompts discovery, which guarantees diverse semantic representations, and 2) token-level alignment, which helps explore fine-grained similarity.
Experiments on popular image recognition benchmarks show the superior generalization and few-shot abilities of our approach.
arXiv Detail & Related papers (2023-09-25T03:20:09Z) - DPL: Decoupled Prompt Learning for Vision-Language Models [41.90997623029582]
We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem.
Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
arXiv Detail & Related papers (2023-08-19T15:48:38Z) - ConES: Concept Embedding Search for Parameter Efficient Tuning Large
Vision Language Models [21.15548013842187]
We propose a Concept Embedding Search (ConES) approach by optimizing prompt embeddings.
By dropping the text encoder, we are able to significantly speed up the learning process.
Our approach can beat the prompt tuning and textual inversion methods in a variety of downstream tasks.
arXiv Detail & Related papers (2023-05-30T12:45:49Z) - Multi-Prompt with Depth Partitioned Cross-Modal Learning [25.239388488952375]
Partitioned Multi-modal Prompt (PMPO) is a multi-modal prompting technique that extends the soft prompt from a single learnable prompt to multiple prompts.
Our method divides the visual encoder depths and connects learnable prompts to the separated visual depths, enabling different prompts to capture hierarchical contextual depths.
We evaluate the effectiveness of our approach on three challenging tasks: new class generalization, cross-dataset evaluation, and domain generalization.
arXiv Detail & Related papers (2023-05-10T14:54:29Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z) - Unified Vision and Language Prompt Learning [86.1530128487077]
We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning.
A major finding is that text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances.
To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities.
arXiv Detail & Related papers (2022-10-13T17:50:24Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.