Related papers: MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

URL: http://arxiv.org/abs/2306.11400v2
Date: Sun, 14 Jul 2024 08:08:13 GMT
Title: MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models
Authors: Yongzhu Miao, Shasha Li, Jintao Tang, Ting Wang,
Abstract summary: We propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT. MuDPT extends independent multi-modal prompt tuning by learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion. Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin.
Score: 12.397136690734865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prompt tuning, like CoOp, has recently shown promising vision recognizing and transfer learning ability on various downstream tasks with the emergence of large pre-trained vision-language models like CLIP. However, we identify that existing uni-modal prompt tuning approaches may result in sub-optimal performance since this uni-modal design breaks the original alignment of textual and visual representations in the pre-trained model. Inspired by the nature of pre-trained vision-language models, we aim to achieve completeness in prompt tuning and propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT, which extends independent multi-modal prompt tuning by additionally learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion. We evaluate the effectiveness of MuDPT on few-shot vision recognition and out-of-domain generalization tasks. Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin thanks to synergistic alignment of textual and visual representations. Our code is available at: https://github.com/Mechrev0/MuDPT.

Related papers

M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains. In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z)
MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality [11.03329286331929]
We present the first comprehensive investigation into prompt learning behavior when modalities are incomplete. We propose a novel Multi-step Adaptive Prompt Learning framework, aiming to generate multimodal prompts and perform multi-step prompt tuning.
arXiv Detail & Related papers (2024-09-07T03:33:46Z)
MePT: Multi-Representation Guided Prompt Tuning for Vision-Language Model [29.181270326129553]
We introduce Multi-Representation Guided Prompt Tuning (MePT) MePT employs a three-branch framework that focuses on diverse salient regions, uncovering the inherent knowledge within images which is crucial for robust generalization. We validate the effectiveness of MePT through extensive experiments, demonstrating significant improvements on both base-to-novel class prediction and domain generalization tasks.
arXiv Detail & Related papers (2024-08-19T05:42:00Z)
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining [48.98105914356609]
Lumina-mGPT is a family of multimodal autoregressive models capable of various vision and language tasks. We introduce Ominiponent Supervised Finetuning, transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification.
arXiv Detail & Related papers (2024-08-05T17:46:53Z)
M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition [39.92547393649842]
We introduce a novel Multimodal, Multi-task CLIP adapting framework named name to address these challenges. We demonstrate exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.
arXiv Detail & Related papers (2024-01-22T02:03:31Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT [58.70209492842953]
In this paper, we focus on adapting prompt design based on instruction tuning into a visual transformer model for image classification. The key idea is to implement multi-modal prompts related to category information to guide the fine-tuning of the model. Based on the experiments of several image captionining tasks, the performance and domain adaptability were improved.
arXiv Detail & Related papers (2023-04-29T08:59:12Z)
Unified Vision and Language Prompt Learning [86.1530128487077]
We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning. A major finding is that text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities.
arXiv Detail & Related papers (2022-10-13T17:50:24Z)
MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z)
Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model [39.722927180264584]
We propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously. To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning scheme is proposed.
arXiv Detail & Related papers (2022-08-17T15:06:36Z)
Prompt Tuning for Generative Multimodal Pretrained Models [75.44457974275154]
We implement prompt tuning on the unified sequence-to-sequence pretrained model adaptive to both understanding and generation tasks. Experimental results demonstrate that the light-weight prompt tuning can achieve comparable performance with finetuning. In comparison with finetuned models, the prompt-tuned models demonstrate improved robustness against adversarial attacks.
arXiv Detail & Related papers (2022-08-04T08:56:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.