Related papers: Learning to Fuse: Modality-Aware Adaptive Scheduling for Robust Multimodal Foundation Models

Learning to Fuse: Modality-Aware Adaptive Scheduling for Robust Multimodal Foundation Models

URL: http://arxiv.org/abs/2506.12733v1
Date: Sun, 15 Jun 2025 05:57:45 GMT
Title: Learning to Fuse: Modality-Aware Adaptive Scheduling for Robust Multimodal Foundation Models
Authors: Liam Bennett, Mason Clark, Lucas Anderson, Hana Satou, Olivia Martinez,
Abstract summary: Modality-Aware Adaptive Fusion Scheduling (MA-AFS) learns to dynamically modulate the contribution of each modality on a per-instance basis.<n>Our work highlights the importance of adaptive fusion and opens a promising direction toward reliable and uncertainty-aware multimodal learning.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Multimodal foundation models have achieved impressive progress across a wide range of vision-language tasks. However, existing approaches often adopt fixed or task-specific fusion strategies, neglecting the intrinsic variability of modality reliability and sample complexity. In this paper, we propose Modality-Aware Adaptive Fusion Scheduling (MA-AFS), a general framework that learns to dynamically modulate the contribution of each modality on a per-instance basis. MA-AFS introduces a lightweight neural scheduler that predicts modality fusion weights by integrating visual and textual entropy signals along with cross-modal agreement cues. This enables the model to adaptively emphasize more reliable modalities, especially under noisy, missing, or misaligned inputs. We formulate the fusion process as a differentiable scheduling mechanism, analyze its theoretical consistency and regularization effect, and demonstrate that it improves robustness without increasing model capacity significantly. Extensive experiments on image-text retrieval, captioning, and visual question answering show that MA-AFS achieves consistent performance gains over strong baselines such as CLIP, ALBEF, and BLIP. Moreover, MA-AFS exhibits improved robustness under modality corruption and enhanced generalization under domain shifts. Our work highlights the importance of adaptive fusion and opens a promising direction toward reliable and uncertainty-aware multimodal learning.

Related papers

Dynamic Modality Scheduling for Multimodal Large Models via Confidence, Uncertainty, and Semantic Consistency [0.0]
We propose Dynamic Modality Scheduling (DMS), a novel framework that adaptively adjusts the contribution of each modality at a per-sample level.<n> Experimental results on VQA, image-text retrieval, and captioning tasks show that DMS significantly improves both clean and robust performance.
arXiv Detail & Related papers (2025-06-15T05:15:52Z)
Modality Equilibrium Matters: Minor-Modality-Aware Adaptive Alternating for Cross-Modal Memory Enhancement [13.424541949553964]
We propose a Shapley-guided alternating training framework that adaptively prioritizes minor modalities to balance and thus enhance the fusion.<n>We evaluate the performance in both balance and accuracy across four multimodal benchmark datasets, where our method achieves state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2025-05-26T02:02:57Z)
Asymmetric Reinforcing against Multi-modal Representation Bias [59.685072206359855]
We propose an Asymmetric Reinforcing method against Multimodal representation bias (ARM)<n>Our ARM dynamically reinforces the weak modalities while maintaining the ability to represent dominant modalities through conditional mutual information.<n>We have significantly improved the performance of multimodal learning, making notable progress in mitigating imbalanced multimodal learning.
arXiv Detail & Related papers (2025-01-02T13:00:06Z)
On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities. The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations. We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z)
Progressively Modality Freezing for Multi-Modal Entity Alignment [27.77877721548588]
We propose a novel strategy of progressive modality freezing, called PMF, that focuses on alignmentrelevant features. Notably, our approach introduces a pioneering cross-modal association loss to foster modal consistency. Empirical evaluations across nine datasets confirm PMF's superiority.
arXiv Detail & Related papers (2024-07-23T04:22:30Z)
Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD) It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z)
MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications. Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders. We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN) We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.