DMPT: Decoupled Modality-aware Prompt Tuning for Multi-modal Object Re-identification
- URL: http://arxiv.org/abs/2504.10985v1
- Date: Tue, 15 Apr 2025 08:48:41 GMT
- Title: DMPT: Decoupled Modality-aware Prompt Tuning for Multi-modal Object Re-identification
- Authors: Minghui Lin, Shu Wang, Xiang Wang, Jianhua Tang, Longbin Fu, Zhengrong Zuo, Nong Sang,
- Abstract summary: We propose an efficient prompt-tuning framework tailored for multi-modal object re-identification.<n>Our framework freezes the main backbone and only optimize several newly added decoupled modality-aware parameters.<n>We show that our framework can achieve competitive results to existing state-of-the-art methods while requiring only 6.5% fine-tuning of the backbone parameters.
- Score: 28.391851855141976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current multi-modal object re-identification approaches based on large-scale pre-trained backbones (i.e., ViT) have displayed remarkable progress and achieved excellent performance. However, these methods usually adopt the standard full fine-tuning paradigm, which requires the optimization of considerable backbone parameters, causing extensive computational and storage requirements. In this work, we propose an efficient prompt-tuning framework tailored for multi-modal object re-identification, dubbed DMPT, which freezes the main backbone and only optimizes several newly added decoupled modality-aware parameters. Specifically, we explicitly decouple the visual prompts into modality-specific prompts which leverage prior modality knowledge from a powerful text encoder and modality-independent semantic prompts which extract semantic information from multi-modal inputs, such as visible, near-infrared, and thermal-infrared. Built upon the extracted features, we further design a Prompt Inverse Bind (PromptIBind) strategy that employs bind prompts as a medium to connect the semantic prompt tokens of different modalities and facilitates the exchange of complementary multi-modal information, boosting final re-identification results. Experimental results on multiple common benchmarks demonstrate that our DMPT can achieve competitive results to existing state-of-the-art methods while requiring only 6.5% fine-tuning of the backbone parameters.
Related papers
- IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification [60.38841251693781]
We propose a novel framework to generate robust multi-modal object ReIDs.
Our framework uses Modal Prefixes and InverseNet to integrate multi-modal information with semantic guidance from inverted text.
Experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2025-03-13T13:00:31Z) - MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt [60.10555128510744]
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities.<n>Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks.<n>We introduce a novel framework called MambaPro for multi-modal object ReID.
arXiv Detail & Related papers (2024-12-14T06:33:53Z) - Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding [7.329728566839757]
We propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF)
MoPE-BAF is a novel multi-modal soft prompt framework based on the unified vision-language model (VLM)
arXiv Detail & Related papers (2024-03-17T19:12:26Z) - MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion [29.46189153751869]
Mixture of Prompt Experts (MoPE) is first technique designed to overcome limitations by decomposing standard prompts.<n>Our MoPE-based fusion method exhibits greater expressiveness, scaling more effectively with the training data and the overall number of trainable parameters.
arXiv Detail & Related papers (2024-03-14T17:47:10Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Conditional Prompt Tuning for Multimodal Fusion [33.11221356852871]
We show that the representation of one modality can effectively guide the prompting of another modality for parameter-efficient multimodal fusion.
This is achieved by disentangling the vanilla prompt vectors into three types of specialized prompts that adaptively capture global-level and instance-level features.
Our method can effectively transfer the pretrained knowledge in unimodal encoders for downstream multimodal tasks.
arXiv Detail & Related papers (2023-11-28T11:05:20Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Diversity-Aware Meta Visual Prompting [111.75306320834629]
We present Diversity-Aware Meta Visual Prompting(DAM-VP), an efficient prompting method for transferring pre-trained models to downstream tasks with frozen backbone.
We cluster the downstream dataset into small subsets in a diversity-strapped way, with each subset has its own prompt separately.
All the prompts are optimized with a meta-prompt, which is learned across several datasets.
arXiv Detail & Related papers (2023-03-14T17:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.