Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning
- URL: http://arxiv.org/abs/2510.21797v2
- Date: Wed, 29 Oct 2025 06:31:09 GMT
- Title: Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning
- Authors: Zhaocheng Liu, Zhiwen Yu, Xiaoqing Liu,
- Abstract summary: Existing solutions mainly focus on optimization- or data-based strategies but rarely exploit the information inherent in multimodal imbalance.<n>We propose a novel quantitative analysis framework for Multimodal Imbalance and design a sample-level adaptive loss function.
- Score: 12.236332735708473
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The heterogeneity of multimodal data leads to inconsistencies and imbalance, allowing a dominant modality to steer gradient updates. Existing solutions mainly focus on optimization- or data-based strategies but rarely exploit the information inherent in multimodal imbalance or conduct its quantitative analysis. To address this gap, we propose a novel quantitative analysis framework for Multimodal Imbalance and design a sample-level adaptive loss function. We define the Modality Gap as the Softmax score difference between modalities for the correct class and model its distribution using a bimodal Gaussian Mixture Model(GMM), representing balanced and imbalanced samples. Using Bayes' theorem, we estimate each sample's posterior probability of belonging to these two groups. Based on this, our adaptive loss (1) minimizes the overall Modality Gap, (2) aligns imbalanced samples with balanced ones, and (3) adaptively penalizes each according to its imbalance degree. A two-stage training strategy-warm-up and adaptive phases,yields state-of-the-art performance on CREMA-D (80.65%), AVE (70.40%), and KineticSound (72.42%). Fine-tuning with high-quality samples identified by the GMM further improves results, highlighting their value for effective multimodal fusion.
Related papers
- MAPLE: Modality-Aware Post-training and Learning Ecosystem [6.41025301801655]
Existing RL post-training pipelines treat all input signals as equally relevant, ignoring which modalities each task actually requires.<n>We introduce MAPLE, a complete modality-aware post-training and learning ecosystem.<n> MAPLE narrows uni/multi-modal accuracy gaps by 30.24%, converges 3.18x faster, and maintains stability across all modality combinations.
arXiv Detail & Related papers (2026-02-12T05:26:36Z) - Revisit Modality Imbalance at the Decision Layer [11.94300606032047]
Multimodal learning integrates information from different modalities to enhance model performance.<n>It often suffers from modality imbalance, where dominant modalities overshadow weaker ones during joint optimization.<n>This paper reveals that such an imbalance not only occurs during representation learning but also manifests significantly at the decision layer.
arXiv Detail & Related papers (2025-10-16T08:11:24Z) - MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning [14.06705718861471]
Multimodal models often over-rely on dominant modalities, failing to achieve optimal performance.<n>We propose MIDAS, a novel data augmentation strategy that generates misaligned samples with semantically inconsistent cross-modal information.<n>Experiments on multiple multimodal classification benchmarks demonstrate that MIDAS significantly outperforms related baselines in addressing modality imbalance.
arXiv Detail & Related papers (2025-09-30T06:13:17Z) - MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources [113.33902847941941]
Variance-Aware Sampling (VAS) is a data selection strategy guided by Variance Promotion Score (VPS)<n>We release large-scale, carefully curated resources containing 1.6M long CoT cold-start data and 15k RL QA pairs.<n> Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS.
arXiv Detail & Related papers (2025-09-25T14:58:29Z) - Importance Weighted Score Matching for Diffusion Samplers with Enhanced Mode Coverage [16.94974733994214]
prevailing methods often circumvent the lack of target data by optimizing reverse KL-based objectives.<n>We propose a principled approach for training diffusion-based samplers by directly targeting an objective analogous to the forward KL divergence.<n>Our approach consistently outperforms existing neural samplers across all distributional distance metrics.
arXiv Detail & Related papers (2025-05-26T02:48:26Z) - Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness [61.87055159919641]
Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities.<n>Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality.<n>We introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM)
arXiv Detail & Related papers (2025-03-24T08:46:52Z) - Balance-aware Sequence Sampling Makes Multi-modal Learning Better [0.5439020425819]
We propose Balance-aware Sequence Sampling (BSS) to enhance the robustness of MML.<n>Via a multi-perspective measurer, we first define a multi-perspective measurer to evaluate the balance degree of each sample.<n>We employ a scheduler based on curriculum learning (CL) that incrementally provides training subsets, progressing from balanced to imbalanced samples to rebalance MML.
arXiv Detail & Related papers (2025-01-01T06:19:55Z) - Generative Lines Matching Models [2.6089354079273512]
We propose a new probability flow model, which matches globally straight lines interpolating the two distributions.<n>We demonstrate that the flow fields produced by the LMM exhibit notable temporal consistency, resulting in trajectories with excellent straightness scores.<n>Overall, the LMM achieves state-of-the-art FID scores with minimal NFEs on established benchmark datasets.
arXiv Detail & Related papers (2024-12-09T11:33:38Z) - On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities.
The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations.
We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z) - Adaptive Weighted Co-Learning for Cross-Domain Few-Shot Learning [23.615250207134004]
Cross-domain few-shot learning (CDFSL) induces a very challenging adaptation problem.
We propose a simple Adaptive Weighted Co-Learning (AWCoL) method to address the CDFSL challenge.
Comprehensive experiments are conducted on multiple benchmark datasets and the empirical results demonstrate that the proposed method produces state-of-the-art CDFSL performance.
arXiv Detail & Related papers (2023-12-06T22:09:52Z) - Ensemble Modeling for Multimodal Visual Action Recognition [50.38638300332429]
We propose an ensemble modeling approach for multimodal action recognition.
We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO [21] dataset.
arXiv Detail & Related papers (2023-08-10T08:43:20Z) - Learning to Re-weight Examples with Optimal Transport for Imbalanced
Classification [74.62203971625173]
Imbalanced data pose challenges for deep learning based classification models.
One of the most widely-used approaches for tackling imbalanced data is re-weighting.
We propose a novel re-weighting method based on optimal transport (OT) from a distributional point of view.
arXiv Detail & Related papers (2022-08-05T01:23:54Z) - Causal Balancing for Domain Generalization [95.97046583437145]
We propose a balanced mini-batch sampling strategy to reduce the domain-specific spurious correlations in observed training distributions.
We provide an identifiability guarantee of the source of spuriousness and show that our proposed approach provably samples from a balanced, spurious-free distribution.
arXiv Detail & Related papers (2022-06-10T17:59:11Z) - DRFLM: Distributionally Robust Federated Learning with Inter-client
Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data.
We propose a general framework to solve the above two challenges simultaneously.
We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.