Synergistic Prompting for Robust Visual Recognition with Missing Modalities
- URL: http://arxiv.org/abs/2507.07802v2
- Date: Fri, 11 Jul 2025 16:27:52 GMT
- Title: Synergistic Prompting for Robust Visual Recognition with Missing Modalities
- Authors: Zhihui Zhang, Luanyuan Dai, Qika Lin, Yunfeng Diao, Guangyin Jin, Yufei Guo, Jing Zhang, Xiaoshuai Hao,
- Abstract summary: Large-scale multi-modal models have demonstrated remarkable performance across various visual recognition tasks.<n>The presence of missing or incomplete modality inputs often leads to significant performance degradation.<n>We propose a novel Synergistic Prompting framework for robust visual recognition with missing modalities.
- Score: 13.821274074204082
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large-scale multi-modal models have demonstrated remarkable performance across various visual recognition tasks by leveraging extensive paired multi-modal training data. However, in real-world applications, the presence of missing or incomplete modality inputs often leads to significant performance degradation. Recent research has focused on prompt-based strategies to tackle this issue; however, existing methods are hindered by two major limitations: (1) static prompts lack the flexibility to adapt to varying missing-data conditions, and (2) basic prompt-tuning methods struggle to ensure reliable performance when critical modalities are missing.To address these challenges, we propose a novel Synergistic Prompting (SyP) framework for robust visual recognition with missing modalities. The proposed SyP introduces two key innovations: (I) a Dynamic Adapter, which computes adaptive scaling factors to dynamically generate prompts, replacing static parameters for flexible multi-modal adaptation, and (II) a Synergistic Prompting Strategy, which combines static and dynamic prompts to balance information across modalities, ensuring robust reasoning even when key modalities are missing. The proposed SyP achieves significant performance improvements over existing approaches across three widely-used visual recognition datasets, demonstrating robustness under diverse missing rates and conditions. Extensive experiments and ablation studies validate its effectiveness in handling missing modalities, highlighting its superior adaptability and reliability.
Related papers
- Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations [67.35596444651037]
Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable.<n>We propose a Reliable Test-time Adaptation (ReTA) method that enhances reliability from two perspectives.
arXiv Detail & Related papers (2025-07-13T05:37:33Z) - Disentangling and Generating Modalities for Recommendation in Missing Modality Scenarios [21.73914052076956]
We propose Disentangling and Generating Modality Recommender (DGMRec) for missing modality scenarios.<n>DGMRec disentangles modality features into general and specific modality features from an information-based perspective.<n>It consistently outperforms state-of-the-art MRSs in challenging scenarios.
arXiv Detail & Related papers (2025-04-23T02:04:14Z) - Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning [27.867369806400834]
We propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework.<n>RAGPT comprises three modules: (I) the multi-channel retriever, (II) the missing modality generator, and (III) the context-aware prompter.<n>Experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems.
arXiv Detail & Related papers (2025-01-02T07:39:48Z) - Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition [52.522244807811894]
We propose a novel multimodal Transformer framework using prompt learning to address the issue of missing modalities.
Our method introduces three types of prompts: generative prompts, missing-signal prompts, and missing-type prompts.
Through prompt learning, we achieve a substantial reduction in the number of trainable parameters.
arXiv Detail & Related papers (2024-07-07T13:55:56Z) - Exploring Missing Modality in Multimodal Egocentric Datasets [89.76463983679058]
We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent.
Our method mitigates the performance loss, reducing it from its original $sim 30%$ drop to only $sim 10%$ when half of the test set is modal-incomplete.
arXiv Detail & Related papers (2024-01-21T11:55:42Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Visual Prompt Flexible-Modal Face Anti-Spoofing [23.58674017653937]
multimodal face data collected from the real world is often imperfect due to missing modalities from various imaging sensors.
We propose flexible-modal FAS, which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to downstream flexible-modal FAS task.
experiments conducted on two multimodal FAS benchmark datasets demonstrate the effectiveness of our VP-FAS framework.
arXiv Detail & Related papers (2023-07-26T05:06:41Z) - Flexible-modal Deception Detection with Audio-Visual Adapter [20.6514221670249]
We propose a novel framework to fuse temporal features across two modalities efficiently.
Experiments conducted on two benchmark datasets demonstrate that the proposed method can achieve superior performance.
arXiv Detail & Related papers (2023-02-11T15:47:20Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - Self-attention fusion for audiovisual emotion recognition with
incomplete data [103.70855797025689]
We consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition.
We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms.
arXiv Detail & Related papers (2022-01-26T18:04:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.