Tuning Multi-mode Token-level Prompt Alignment across Modalities
- URL: http://arxiv.org/abs/2309.13847v2
- Date: Thu, 26 Oct 2023 11:04:19 GMT
- Title: Tuning Multi-mode Token-level Prompt Alignment across Modalities
- Authors: Dongsheng Wang, Miaoge Li, Xinyang Liu, MingSheng Xu, Bo Chen, Hanwang
Zhang
- Abstract summary: We propose a multi-mode token-level tuning framework to learn and align a set of prompt tokens across modalities.
Specifically, we rely on two essential factors: 1) multi-mode prompts discovery, which guarantees diverse semantic representations, and 2) token-level alignment, which helps explore fine-grained similarity.
Experiments on popular image recognition benchmarks show the superior generalization and few-shot abilities of our approach.
- Score: 48.39511580746271
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advancements in prompt tuning of vision-language models have underscored
their potential in enhancing open-world visual concept comprehension. However,
prior works only primarily focus on single-mode (only one prompt for each
modality) and holistic level (image or sentence) semantic alignment, which
fails to capture the sample diversity, leading to sub-optimal prompt discovery.
To address the limitation, we propose a multi-mode token-level tuning framework
that leverages the optimal transportation to learn and align a set of prompt
tokens across modalities. Specifically, we rely on two essential factors: 1)
multi-mode prompts discovery, which guarantees diverse semantic
representations, and 2) token-level alignment, which helps explore fine-grained
similarity. Consequently, the similarity can be calculated as a hierarchical
transportation problem between the modality-specific sets. Extensive
experiments on popular image recognition benchmarks show the superior
generalization and few-shot abilities of our approach. The qualitative analysis
demonstrates that the learned prompt tokens have the ability to capture diverse
visual concepts.
Related papers
- Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Disentangling Multi-view Representations Beyond Inductive Bias [32.15900989696017]
We propose a novel multi-view representation disentangling method that ensures both interpretability and generalizability of the resulting representations.
Our experiments on four multi-view datasets demonstrate that our proposed method outperforms 12 comparison methods in terms of clustering and classification performance.
arXiv Detail & Related papers (2023-08-03T09:09:28Z) - MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models [12.397136690734865]
We propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT.
MuDPT extends independent multi-modal prompt tuning by learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion.
Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin.
arXiv Detail & Related papers (2023-06-20T09:15:52Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models [48.77653835765705]
We introduce a probabilistic resolution to prompt tuning, where the label-specific prompts are generated hierarchically by first sampling a latent vector from an underlying distribution and then employing a lightweight generative model.
We evaluate the effectiveness of our approach on four tasks: few-shot image recognition, base-to-new generalization, dataset transfer learning, and domain shifts.
arXiv Detail & Related papers (2023-03-16T06:09:15Z) - Prompt Learning with Optimal Transport for Vision-Language Models [25.928455328563402]
We learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts.
To solve this problem, we propose to apply optimal transport to match the vision and text modalities.
In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data.
arXiv Detail & Related papers (2022-10-03T22:21:07Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Cross-Modal Discrete Representation Learning [73.68393416984618]
We present a self-supervised learning framework that learns a representation that captures finer levels of granularity across different modalities.
Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities.
arXiv Detail & Related papers (2021-06-10T00:23:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.