Related papers: Improving Multimodal Learning Balance and Sufficiency through Data Remixing

Improving Multimodal Learning Balance and Sufficiency through Data Remixing

URL: http://arxiv.org/abs/2506.11550v2
Date: Mon, 16 Jun 2025 02:50:29 GMT
Title: Improving Multimodal Learning Balance and Sufficiency through Data Remixing
Authors: Xiaoyu Ma, Hao Chen, Yongjian Deng,
Abstract summary: Methods for enforcing the weak modality fail to achieve unimodal sufficiency and multimodal balance.<n>We propose multimodal Data Remixing, including decoupling multimodal data and filtering hard samples for each modality to mitigate modality imbalance.<n>Our method can be seamlessly integrated with existing approaches, improving accuracy by approximately 6.50%$uparrow$ on CREMAD and 3.41%$uparrow$ on Kinetic-Sounds.
Score: 14.282792733217653
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Different modalities hold considerable gaps in optimization trajectories, including speeds and paths, which lead to modality laziness and modality clash when jointly training multimodal models, resulting in insufficient and imbalanced multimodal learning. Existing methods focus on enforcing the weak modality by adding modality-specific optimization objectives, aligning their optimization speeds, or decomposing multimodal learning to enhance unimodal learning. These methods fail to achieve both unimodal sufficiency and multimodal balance. In this paper, we, for the first time, address both concerns by proposing multimodal Data Remixing, including decoupling multimodal data and filtering hard samples for each modality to mitigate modality imbalance; and then batch-level reassembling to align the gradient directions and avoid cross-modal interference, thus enhancing unimodal learning sufficiency. Experimental results demonstrate that our method can be seamlessly integrated with existing approaches, improving accuracy by approximately 6.50%$\uparrow$ on CREMAD and 3.41%$\uparrow$ on Kinetic-Sounds, without training set expansion or additional computational overhead during inference. The source code is available at https://github.com/MatthewMaxy/Remix_ICML2025.

Related papers

G$^{2}$D: Boosting Multimodal Learning with Gradient-Guided Distillation [0.7673339435080445]
We introduce Gradient-Guided Distillation (G$2$D), a knowledge distillation framework that optimize the multimodal model with a custom-built loss function.<n>We show that G$2$D amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks.
arXiv Detail & Related papers (2025-06-26T17:37:36Z)
Continual Multimodal Contrastive Learning [70.60542106731813]
Multimodal contrastive learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space.<n>However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive.<n>In this paper, we formulate CMCL through two specialized principles of stability and plasticity.<n>We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge.
arXiv Detail & Related papers (2025-03-19T07:57:08Z)
Rebalanced Multimodal Learning with Data-aware Unimodal Sampling [39.77348232514481]
We propose a novel MML approach called underlineData-aware underlineUnimodal underlineSampling(method)<n>Based on the learning status, we propose a reinforcement learning(RL)-based data-aware unimodal sampling approaches.<n>Our method can be seamlessly incorporated into almost all existing multimodal learning approaches as a plugin.
arXiv Detail & Related papers (2025-03-05T08:19:31Z)
Multimodal Fusion Balancing Through Game-Theoretic Regularization [22.959030061257533]
We show that current balancing methods struggle to train multimodal models that surpass even simple baselines, such as ensembles.<n>This raises the question: how can we ensure that all modalities in multimodal training are sufficiently trained, and that learning from new modalities consistently improves performance?<n>This paper proposes the Multimodal Competition Regularizer (MCR), a new loss component inspired by mutual information (MI) decomposition.
arXiv Detail & Related papers (2024-11-11T19:53:05Z)
LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities. PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z)
On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities. The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations. We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z)
Modality-Balanced Learning for Multimedia Recommendation [21.772064939915214]
We propose a Counterfactual Knowledge Distillation method to solve the imbalance problem and make the best use of all modalities. We also design a novel generic-and-specific distillation loss to guide the multimodal student to learn wider-and-deeper knowledge from teachers. Our method could serve as a plug-and-play module for both late-fusion and early-fusion backbones.
arXiv Detail & Related papers (2024-07-26T07:53:01Z)
Diagnosing and Re-learning for Balanced Multimodal Learning [8.779005254634857]
We propose the Diagnosing & Re-learning method to overcome the imbalanced multimodal learning problem. The learning state of each modality is estimated based on the separability of its uni-modal representation space. In this way, the over-emphasizing of scarcely informative modalities is avoided.
arXiv Detail & Related papers (2024-07-12T22:12:03Z)
Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning. MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process. It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities. Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning. We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.