Multimodal Prompting with Missing Modalities for Visual Recognition
- URL: http://arxiv.org/abs/2303.03369v2
- Date: Thu, 9 Mar 2023 18:52:25 GMT
- Title: Multimodal Prompting with Missing Modalities for Visual Recognition
- Authors: Yi-Lun Lee, Yi-Hsuan Tsai, Wei-Chen Chiu, Chen-Yu Lee
- Abstract summary: We tackle two challenges in multimodal learning for visual recognition: 1) when missing-modality occurs during training or testing in real-world situations; and 2) when computation resources are not available to finetune on heavy transformer models.
Specifically, our modality-missing-aware prompts can be plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 1% learnable parameters compared to training the entire model.
- Score: 40.961534960897595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we tackle two challenges in multimodal learning for visual
recognition: 1) when missing-modality occurs either during training or testing
in real-world situations; and 2) when the computation resources are not
available to finetune on heavy transformer models. To this end, we propose to
utilize prompt learning and mitigate the above two challenges together.
Specifically, our modality-missing-aware prompts can be plugged into multimodal
transformers to handle general missing-modality cases, while only requiring
less than 1% learnable parameters compared to training the entire model. We
further explore the effect of different prompt configurations and analyze the
robustness to missing modality. Extensive experiments are conducted to show the
effectiveness of our prompt learning framework that improves the performance
under various missing-modality cases, while alleviating the requirement of
heavy model re-training. Code is available.
Related papers
- Chameleon: Images Are What You Need For Multimodal Learning Robust To Missing Modalities [17.723207830420996]
Multimodal learning methods often exhibit deteriorated performances if one or more modalities are missing.
We propose a robust textual-visual multimodal learning method, Chameleon, that completely deviates from the conventional multi-branch design.
Experiments are performed on four popular datasets including Hateful Memes, UPMC Food-101, MM-IMDb, and Ferramenta.
arXiv Detail & Related papers (2024-07-23T07:29:57Z) - Encapsulating Knowledge in One Prompt [56.31088116526825]
KiOP encapsulates knowledge from various models into a solitary prompt without altering the original models or requiring access to the training data.
From a practicality standpoint, this paradigm proves the effectiveness of Visual Prompt in data inaccessible contexts.
Experiments across various datasets and models demonstrate the efficacy of the proposed KiOP knowledge transfer paradigm.
arXiv Detail & Related papers (2024-07-16T16:35:23Z) - Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition [52.522244807811894]
We propose a novel multimodal Transformer framework using prompt learning to address the issue of missing modalities.
Our method introduces three types of prompts: generative prompts, missing-signal prompts, and missing-type prompts.
Through prompt learning, we achieve a substantial reduction in the number of trainable parameters.
arXiv Detail & Related papers (2024-07-07T13:55:56Z) - Combating Missing Modalities in Egocentric Videos at Test Time [92.38662956154256]
Real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues.
We propose a novel approach to address this issue at test time without requiring retraining.
MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time.
arXiv Detail & Related papers (2024-04-23T16:01:33Z) - Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity [9.811378971225727]
This paper extends the current research into missing modalities to the low-data regime.
It is often expensive to get full-modality data and sufficient annotated training samples.
We propose to use retrieval-augmented in-context learning to address these two crucial issues.
arXiv Detail & Related papers (2024-03-14T14:19:48Z) - Can Text-to-image Model Assist Multi-modal Learning for Visual
Recognition with Visual Modality Missing? [37.73329106465031]
We propose a text-to-image framework GTI-MM to enhance the data efficiency and model robustness against missing visual modality.
Our findings reveal that synthetic images benefit training data efficiency with visual data missing in training and improve model robustness with visual data missing involving training and testing.
arXiv Detail & Related papers (2024-02-14T09:21:00Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Visual Prompt Flexible-Modal Face Anti-Spoofing [23.58674017653937]
multimodal face data collected from the real world is often imperfect due to missing modalities from various imaging sensors.
We propose flexible-modal FAS, which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to downstream flexible-modal FAS task.
experiments conducted on two multimodal FAS benchmark datasets demonstrate the effectiveness of our VP-FAS framework.
arXiv Detail & Related papers (2023-07-26T05:06:41Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.