Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning
- URL: http://arxiv.org/abs/2501.01120v1
- Date: Thu, 02 Jan 2025 07:39:48 GMT
- Title: Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning
- Authors: Jian Lang, Zhangtao Cheng, Ting Zhong, Fan Zhou,
- Abstract summary: We propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework.
RAGPT comprises three modules: (I) the multi-channel retriever, (II) the missing modality generator, and (III) the context-aware prompter.
Experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems.
- Score: 27.867369806400834
- License:
- Abstract: Multimodal learning with incomplete modality is practical and challenging. Recently, researchers have focused on enhancing the robustness of pre-trained MultiModal Transformers (MMTs) under missing modality conditions by applying learnable prompts. However, these prompt-based methods face several limitations: (1) incomplete modalities provide restricted modal cues for task-specific inference, (2) dummy imputation for missing content causes information loss and introduces noise, and (3) static prompts are instance-agnostic, offering limited knowledge for instances with various missing conditions. To address these issues, we propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework. RAGPT comprises three modules: (I) the multi-channel retriever, which identifies similar instances through a within-modality retrieval strategy, (II) the missing modality generator, which recovers missing information using retrieved contexts, and (III) the context-aware prompter, which captures contextual knowledge from relevant instances and generates dynamic prompts to largely enhance the MMT's robustness. Extensive experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems. The code of our work and prompt-based baselines is available at https://github.com/Jian-Lang/RAGPT.
Related papers
- Towards Modality Generalization: A Benchmark and Prospective Analysis [56.84045461854789]
This paper introduces Modality Generalization (MG), which focuses on enabling models to generalize to unseen modalities.
We propose a comprehensive benchmark featuring multi-modal algorithms and adapt existing methods that focus on generalization.
Our work provides a foundation for advancing robust and adaptable multi-modal models, enabling them to handle unseen modalities in realistic scenarios.
arXiv Detail & Related papers (2024-12-24T08:38:35Z) - Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [102.31558123570437]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)
We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z) - RoRA-VLM: Robust Retrieval-Augmented Vision Language Models [41.09545760534495]
RORA-VLM is a novel and robust retrieval augmentation framework specifically tailored for vision-language models.
We conduct extensive experiments to validate the effectiveness and robustness of our proposed methods on three widely adopted benchmark datasets.
arXiv Detail & Related papers (2024-10-11T14:51:00Z) - MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality [11.03329286331929]
We present the first comprehensive investigation into prompt learning behavior when modalities are incomplete.
We propose a novel Multi-step Adaptive Prompt Learning framework, aiming to generate multimodal prompts and perform multi-step prompt tuning.
arXiv Detail & Related papers (2024-09-07T03:33:46Z) - Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition [52.522244807811894]
We propose a novel multimodal Transformer framework using prompt learning to address the issue of missing modalities.
Our method introduces three types of prompts: generative prompts, missing-signal prompts, and missing-type prompts.
Through prompt learning, we achieve a substantial reduction in the number of trainable parameters.
arXiv Detail & Related papers (2024-07-07T13:55:56Z) - Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality Benchmarks [21.139161163767884]
Modal information might miss due to factors such as thermal sensor self-calibration and data transmission error.
We propose a novel invertible prompt learning approach, which integrates the content-preserving prompts into a well-trained tracking model.
Our method achieves significant performance improvements compared with state-of-the-art methods.
arXiv Detail & Related papers (2023-12-25T11:39:00Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - Multimodal Prompting with Missing Modalities for Visual Recognition [40.961534960897595]
We tackle two challenges in multimodal learning for visual recognition: 1) when missing-modality occurs during training or testing in real-world situations; and 2) when computation resources are not available to finetune on heavy transformer models.
Specifically, our modality-missing-aware prompts can be plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 1% learnable parameters compared to training the entire model.
arXiv Detail & Related papers (2023-03-06T18:54:46Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - MM-Align: Learning Optimal Transport-based Alignment Dynamics for Fast
and Accurate Inference on Missing Modality Sequences [32.42505193560884]
We present a novel approach named MM-Align to address the missing-modality inference problem.
MM-Align learns to capture and imitate the alignment dynamics between modality sequences.
Our method can perform more accurate and faster inference and relieve overfitting under various missing conditions.
arXiv Detail & Related papers (2022-10-23T17:44:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.