Related papers: MILES: Modality-Informed Learning Rate Scheduler for Balancing Multimodal Learning

MILES: Modality-Informed Learning Rate Scheduler for Balancing Multimodal Learning

URL: http://arxiv.org/abs/2510.17394v1
Date: Mon, 20 Oct 2025 10:34:59 GMT
Title: MILES: Modality-Informed Learning Rate Scheduler for Balancing Multimodal Learning
Authors: Alejandro Guerra-Manzanares, Farah E. Shamout,
Abstract summary: We present Modality-Informed Learning ratE Scheduler (MILES) for training multimodal joint fusion models.<n>MILES balances modality-wise conditional utilization rates during training to effectively balance multimodal learning.<n>Our results show that MILES outperforms all baselines across all tasks and fusion methods considered in our study.
Score: 47.487732221767196
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The aim of multimodal neural networks is to combine diverse data sources, referred to as modalities, to achieve enhanced performance compared to relying on a single modality. However, training of multimodal networks is typically hindered by modality overfitting, where the network relies excessively on one of the available modalities. This often yields sub-optimal performance, hindering the potential of multimodal learning and resulting in marginal improvements relative to unimodal models. In this work, we present the Modality-Informed Learning ratE Scheduler (MILES) for training multimodal joint fusion models in a balanced manner. MILES leverages the differences in modality-wise conditional utilization rates during training to effectively balance multimodal learning. The learning rate is dynamically adjusted during training to balance the speed of learning from each modality by the multimodal model, aiming for enhanced performance in both multimodal and unimodal predictions. We extensively evaluate MILES on four multimodal joint fusion tasks and compare its performance to seven state-of-the-art baselines. Our results show that MILES outperforms all baselines across all tasks and fusion methods considered in our study, effectively balancing modality usage during training. This results in improved multimodal performance and stronger modality encoders, which can be leveraged when dealing with unimodal samples or absent modalities. Overall, our work highlights the impact of balancing multimodal learning on improving model performance.

Related papers

Mixup Helps Understanding Multimodal Video Better [12.281180208753021]
Multimodal models are prone to overfitting strong modalities, which can dominate learning and suppress the contributions of weaker ones.<n>We propose Multimodal Mixup (MM), which applies the Mixup strategy at the aggregated multimodal feature level to mitigate overfitting.<n>We also introduce Balanced Multimodal Mixup (B-MM), which dynamically adjusts the mixing ratios for each modality based on their relative contributions to the learning objective.
arXiv Detail & Related papers (2025-10-13T03:53:25Z)
DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning [27.20479303843989]
DynCIM is a novel dynamic curriculum learning framework designed to quantify the inherent imbalances from both sample and modality perspectives.<n>DynCIM employs a sample-level curriculum to dynamically assess each sample's difficulty according to prediction deviation, consistency, and stability.<n>A modality-level curriculum measures modality contributions from global and local.
arXiv Detail & Related papers (2025-03-09T05:30:15Z)
Asymmetric Reinforcing against Multi-modal Representation Bias [59.685072206359855]
We propose an Asymmetric Reinforcing method against Multimodal representation bias (ARM)<n>Our ARM dynamically reinforces the weak modalities while maintaining the ability to represent dominant modalities through conditional mutual information.<n>We have significantly improved the performance of multimodal learning, making notable progress in mitigating imbalanced multimodal learning.
arXiv Detail & Related papers (2025-01-02T13:00:06Z)
Balancing Multimodal Training Through Game-Theoretic Regularization [26.900302082724295]
Multimodal learning holds promise for richer information extraction by capturing dependencies across data sources.<n>Yet, current training methods often underperform due to modality competition.<n>This paper proposes the Multimodal Competition Regularizer (MCR), inspired by a mutual information (MI) decomposition.
arXiv Detail & Related papers (2024-11-11T19:53:05Z)
On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities. The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations. We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z)
Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach [29.428067329993173]
We propose a modality invariant multimodal learning method, which is less susceptible to the impact of missing modalities. It consists of a single-branch network sharing weights across multiple modalities to learn inter-modality representations to maximize performance. Our proposed method achieves superior performance when all modalities are present as well as in the case of missing modalities during training or testing compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-14T10:32:16Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning. We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z)
Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z)
Towards Balanced Active Learning for Multimodal Classification [15.338417969382212]
Training multimodal networks requires a vast amount of data due to their larger parameter space compared to unimodal networks. Active learning is a widely used technique for reducing data annotation costs by selecting only those samples that could contribute to improving model performance. Current active learning strategies are mostly designed for unimodal tasks, and when applied to multimodal data, they often result in biased sample selection from the dominant modality.
arXiv Detail & Related papers (2023-06-14T07:23:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.