What Makes for Robust Multi-Modal Models in the Face of Missing
Modalities?
- URL: http://arxiv.org/abs/2310.06383v1
- Date: Tue, 10 Oct 2023 07:47:57 GMT
- Title: What Makes for Robust Multi-Modal Models in the Face of Missing
Modalities?
- Authors: Siting Li, Chenzhuang Du, Yue Zhao, Yu Huang, Hang Zhao
- Abstract summary: We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective.
We introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA)
UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities.
- Score: 35.19295402483624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the growing success of multi-modal learning, research on the robustness
of multi-modal models, especially when facing situations with missing
modalities, is receiving increased attention. Nevertheless, previous studies in
this domain exhibit certain limitations, as they often lack theoretical
insights or their methodologies are tied to specific network architectures or
modalities. We model the scenarios of multi-modal models encountering missing
modalities from an information-theoretic perspective and illustrate that the
performance ceiling in such scenarios can be approached by efficiently
utilizing the information inherent in non-missing modalities. In practice,
there are two key aspects: (1) The encoder should be able to extract
sufficiently good features from the non-missing modality; (2) The extracted
features should be robust enough not to be influenced by noise during the
fusion process across modalities. To this end, we introduce Uni-Modal Ensemble
with Missing Modality Adaptation (UME-MMA). UME-MMA employs uni-modal
pre-trained weights for the multi-modal model to enhance feature extraction and
utilizes missing modality data augmentation techniques to better adapt to
situations with missing modalities. Apart from that, UME-MMA, built on a
late-fusion learning framework, allows for the plug-and-play use of various
encoders, making it suitable for a wide range of modalities and enabling
seamless integration of large-scale pre-trained encoders to further enhance
performance. And we demonstrate UME-MMA's effectiveness in audio-visual
datasets~(e.g., AV-MNIST, Kinetics-Sound, AVE) and vision-language
datasets~(e.g., MM-IMDB, UPMC Food101).
Related papers
- LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - Alt-MoE: Multimodal Alignment via Alternating Optimization of Multi-directional MoE with Unimodal Models [7.134682404460003]
We introduce a novel training framework, Alt-MoE, which employs the Mixture of Experts (MoE) as a unified multi-directional connector across modalities.
Our methodology has been validated on several well-performing uni-modal models.
arXiv Detail & Related papers (2024-09-09T10:40:50Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.