Related papers: Boosting Multimodal Learning via Disentangled Gradient Learning

Boosting Multimodal Learning via Disentangled Gradient Learning

URL: http://arxiv.org/abs/2507.10213v1
Date: Mon, 14 Jul 2025 12:31:28 GMT
Title: Boosting Multimodal Learning via Disentangled Gradient Learning
Authors: Shicai Wei, Chunbo Luo, Yang Luo,
Abstract summary: Multimodal learning often encounters the under-optimized problem and may have worse performance than unimodal learning.<n>We reveal the optimization conflict between the modality encoder and modality fusion module in multimodal models.<n>We propose a disentangled gradient learning (DGL) framework to decouple the optimization of the modality encoder and modality fusion module.
Score: 6.93254775445168
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal learning often encounters the under-optimized problem and may have worse performance than unimodal learning. Existing methods attribute this problem to the imbalanced learning between modalities and rebalance them through gradient modulation. However, they fail to explain why the dominant modality in multimodal models also underperforms that in unimodal learning. In this work, we reveal the optimization conflict between the modality encoder and modality fusion module in multimodal models. Specifically, we prove that the cross-modal fusion in multimodal models decreases the gradient passed back to each modality encoder compared with unimodal models. Consequently, the performance of each modality in the multimodal model is inferior to that in the unimodal model. To this end, we propose a disentangled gradient learning (DGL) framework to decouple the optimization of the modality encoder and modality fusion module in the multimodal model. DGL truncates the gradient back-propagated from the multimodal loss to the modality encoder and replaces it with the gradient from unimodal loss. Besides, DGL removes the gradient back-propagated from the unimodal loss to the modality fusion module. This helps eliminate the gradient interference between the modality encoder and modality fusion module while ensuring their respective optimization processes. Finally, extensive experiments on multiple types of modalities, tasks, and frameworks with dense cross-modal interaction demonstrate the effectiveness and versatility of the proposed DGL. Code is available at \href{https://github.com/shicaiwei123/ICCV2025-GDL}{https://github.com/shicaiwei123/ICCV2025-GDL}

Related papers

G$^{2}$D: Boosting Multimodal Learning with Gradient-Guided Distillation [0.7673339435080445]
We introduce Gradient-Guided Distillation (G$2$D), a knowledge distillation framework that optimize the multimodal model with a custom-built loss function.<n>We show that G$2$D amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks.
arXiv Detail & Related papers (2025-06-26T17:37:36Z)
Improving Multimodal Learning Balance and Sufficiency through Data Remixing [14.282792733217653]
Methods for enforcing the weak modality fail to achieve unimodal sufficiency and multimodal balance.<n>We propose multimodal Data Remixing, including decoupling multimodal data and filtering hard samples for each modality to mitigate modality imbalance.<n>Our method can be seamlessly integrated with existing approaches, improving accuracy by approximately 6.50%$uparrow$ on CREMAD and 3.41%$uparrow$ on Kinetic-Sounds.
arXiv Detail & Related papers (2025-06-13T08:01:29Z)
Continual Multimodal Contrastive Learning [70.60542106731813]
Multimodal contrastive learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space.<n>However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive.<n>In this paper, we formulate CMCL through two specialized principles of stability and plasticity.<n>We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge.
arXiv Detail & Related papers (2025-03-19T07:57:08Z)
Classifier-guided Gradient Modulation for Enhanced Multimodal Learning [50.7008456698935]
Gradient-Guided Modulation (CGGM) is a novel method to balance multimodal learning with gradients. We conduct extensive experiments on four multimodal datasets: UPMC-Food 101, CMU-MOSI, IEMOCAP and BraTS. CGGM outperforms all the baselines and other state-of-the-art methods consistently.
arXiv Detail & Related papers (2024-11-03T02:38:43Z)
SurgeryV2: Bridging the Gap Between Model Merging and Multi-Task Learning with Deep Representation Surgery [54.866490321241905]
Model merging-based multitask learning (MTL) offers a promising approach for performing MTL by merging multiple expert models. In this paper, we examine the merged model's representation distribution and uncover a critical issue of "representation bias" This bias arises from a significant distribution gap between the representations of the merged and expert models, leading to the suboptimal performance of the merged MTL model.
arXiv Detail & Related papers (2024-10-18T11:49:40Z)
ReconBoost: Boosting Can Achieve Modality Reconcilement [89.4377895465204]
We study the modality-alternating learning paradigm to achieve reconcilement. We propose a new method called ReconBoost to update a fixed modality each time. We show that the proposed method resembles Friedman's Gradient-Boosting (GB) algorithm, where the updated learner can correct errors made by others.
arXiv Detail & Related papers (2024-05-15T13:22:39Z)
Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning. MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process. It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities. Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z)
Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning. We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z)
Mitigating Modality Collapse in Multimodal VAEs via Impartial Optimization [7.4262579052708535]
We argue that this effect is a consequence of conflicting gradients during multimodal VAE training. We show how to detect the sub-graphs in the computational graphs where gradients conflict. We empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.
arXiv Detail & Related papers (2022-06-09T13:29:25Z)
Balanced Multimodal Learning via On-the-fly Gradient Modulation [10.5602074277814]
Multimodal learning helps to comprehensively understand the world, by integrating different senses. We propose on-the-fly gradient modulation to adaptively control the optimization of each modality, via monitoring the discrepancy of their contribution towards the learning objective.
arXiv Detail & Related papers (2022-03-29T08:26:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.