SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality
- URL: http://arxiv.org/abs/2507.19264v2
- Date: Wed, 06 Aug 2025 17:01:13 GMT
- Title: SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality
- Authors: Sijie Li, Chen Chen, Jungong Han,
- Abstract summary: We propose SimMLM, a simple yet powerful framework for multimodal learning with missing modalities.<n>SimMLM consists of a generic Dynamic Mixture of Modality Experts (DMoME) architecture, featuring a dynamic, learnable gating mechanism.<n>Key innovation of SimMLM is the proposed More vs. Fewer (MoFe) ranking loss, which ensures that task accuracy improves or remains stable as more modalities are made available.
- Score: 52.948791050405525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose SimMLM, a simple yet powerful framework for multimodal learning with missing modalities. Unlike existing approaches that rely on sophisticated network architectures or complex data imputation techniques, SimMLM provides a generic and effective solution that can adapt to various missing modality scenarios with improved accuracy and robustness. Specifically, SimMLM consists of a generic Dynamic Mixture of Modality Experts (DMoME) architecture, featuring a dynamic, learnable gating mechanism that automatically adjusts each modality's contribution in both full and partial modality settings. A key innovation of SimMLM is the proposed More vs. Fewer (MoFe) ranking loss, which ensures that task accuracy improves or remains stable as more modalities are made available. This aligns the model with an intuitive principle: removing one or more modalities should not increase accuracy. We validate SimMLM on multimodal medical image segmentation (BraTS 2018) and multimodal classification (UPMC Food-101, avMNIST) tasks, where it consistently surpasses competitive methods, demonstrating superior accuracy, interpretability, robustness, and reliability across both complete and missing modality scenarios at test time.
Related papers
- Learning to Fuse: Modality-Aware Adaptive Scheduling for Robust Multimodal Foundation Models [0.0]
Modality-Aware Adaptive Fusion Scheduling (MA-AFS) learns to dynamically modulate the contribution of each modality on a per-instance basis.<n>Our work highlights the importance of adaptive fusion and opens a promising direction toward reliable and uncertainty-aware multimodal learning.
arXiv Detail & Related papers (2025-06-15T05:57:45Z) - Dynamic Modality Scheduling for Multimodal Large Models via Confidence, Uncertainty, and Semantic Consistency [0.0]
We propose Dynamic Modality Scheduling (DMS), a novel framework that adaptively adjusts the contribution of each modality at a per-sample level.<n> Experimental results on VQA, image-text retrieval, and captioning tasks show that DMS significantly improves both clean and robust performance.
arXiv Detail & Related papers (2025-06-15T05:15:52Z) - BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation [55.486872677160015]
We reformulate multi-modal semantic segmentation as a mask-level classification task.<n>We propose BiXFormer, which integrates Unified Modality Matching (UMM) and Cross Modality Alignment (CMA)<n> Experiments on both synthetic and real-world multi-modal benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2025-06-04T08:04:58Z) - Robust Multimodal Learning via Cross-Modal Proxy Tokens [11.704477276235847]
Cross-modal proxy tokens (CMPTs) approximate the class token of a missing modality without requiring explicit modality generation or auxiliary networks.<n>We employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss.<n>Our method offers a flexible and efficient solution for robust multimodal learning.
arXiv Detail & Related papers (2025-01-29T18:15:49Z) - Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)<n>We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.<n>We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z) - LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - MetaOpenFOAM: an LLM-based multi-agent framework for CFD [11.508919041921942]
MetaOpenFOAM is a novel multi-agent collaborations framework.
It aims to complete CFD simulation tasks with only natural language as input.
It harnesses the power of MetaGPT's assembly line paradigm.
arXiv Detail & Related papers (2024-07-31T04:01:08Z) - Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - MA-ViT: Modality-Agnostic Vision Transformers for Face Anti-Spoofing [3.3031006227198003]
We present Modality-Agnostic Vision Transformer (MA-ViT), which aims to improve the performance of arbitrary modal attacks with the help of multi-modal data.
Specifically, MA-ViT adopts the early fusion to aggregate all the available training modalities data and enables flexible testing of any given modal samples.
Experiments demonstrate that the single model trained on MA-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-04-15T13:03:44Z) - Missing Modality meets Meta Sampling (M3S): An Efficient Universal
Approach for Multimodal Sentiment Analysis with Missing Modality [5.171058506312429]
We propose a simple yet effective meta-sampling approach for multimodal sentiment analysis with missing modalities, namely Missing Modality-based Meta Sampling (M3S)
M3S formulates a missing modality sampling strategy into the modal agnostic meta-learning framework.
We conduct experiments on IEMOCAP, SIMS and CMU-MOSI datasets, and superior performance is achieved compared with recent state-of-the-art methods.
arXiv Detail & Related papers (2022-10-07T09:54:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.