Related papers: Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization

Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization

URL: http://arxiv.org/abs/2505.06635v1
Date: Sat, 10 May 2025 12:58:15 GMT
Title: Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization
Authors: Xu Zheng, Yuanhuiyi Lyu, Lutao Jiang, Danda Pani Paudel, Luc Van Gool, Xuming Hu,
Abstract summary: Fusing and balancing multi-modal inputs from novel sensors for dense prediction tasks is critically important.<n>One major limitation is the tendency of multi-modal frameworks to over-rely on easily learnable modalities.<n>We propose a plug-and-play regularization term based on functional entropy, which introduces no additional parameters.
Score: 66.10528870853324
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fusing and balancing multi-modal inputs from novel sensors for dense prediction tasks, particularly semantic segmentation, is critically important yet remains a significant challenge. One major limitation is the tendency of multi-modal frameworks to over-rely on easily learnable modalities, a phenomenon referred to as unimodal dominance or bias. This issue becomes especially problematic in real-world scenarios where the dominant modality may be unavailable, resulting in severe performance degradation. To this end, we apply a simple but effective plug-and-play regularization term based on functional entropy, which introduces no additional parameters or modules. This term is designed to intuitively balance the contribution of each visual modality to the segmentation results. Specifically, we leverage the log-Sobolev inequality to bound functional entropy using functional-Fisher-information. By maximizing the information contributed by each visual modality, our approach mitigates unimodal dominance and establishes a more balanced and robust segmentation framework. A multi-scale regularization module is proposed to apply our proposed plug-and-play term on high-level features and also segmentation predictions for more balanced multi-modal learning. Extensive experiments on three datasets demonstrate that our proposed method achieves superior performance, i.e., +13.94%, +3.25%, and +3.64%, without introducing any additional parameters.

Related papers

Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models [6.350443894942629]
Multimodal Weight Allocation Module (MWAM) is a plug-and-play component that dynamically re-balances the contribution of each branch during training.<n>MWAM delivers consistent performance gains across a wide range of tasks and modality combinations.
arXiv Detail & Related papers (2026-02-26T05:51:41Z)
FDRMFL:Multi-modal Federated Feature Extraction Model Based on Information Maximization and Contrastive Learning [4.453671369861554]
This study focuses on the feature extraction problem in multi-modal data regression.<n>It addresses three core challenges in real-world scenarios: limited and non-IID data, effective extraction and fusion of multi-modal information, and susceptibility to catastrophic forgetting in model learning.
arXiv Detail & Related papers (2025-11-30T17:13:35Z)
I$^3$-MRec: Invariant Learning with Information Bottleneck for Incomplete Modality Recommendation [56.55935146424585]
We introduce textbfI$3$-MRec, which learns with textbfInformation bottleneck principle for textbfIncomplete textbfModality textbfRecommendation.<n>By treating each modality as a distinct semantic environment, I$3$-MRec employs invariant risk minimization (IRM) to learn preference-oriented representations.<n>I$3$-MRec consistently outperforms existing state-of-the-art MRS methods across various modality-missing scenarios
arXiv Detail & Related papers (2025-08-06T09:29:50Z)
Generalized Multimodal Fusion via Poisson-Nernst-Planck Equation [5.022049774600693]
This paper proposes a generalized multimodal fusion method (GMF) via the Poisson-Nernst-Planck (PNP) equation. We show that the proposed GMF achieves close to the state-of-the-art (SOTA) accuracy while utilizing fewer parameters and computational resources.
arXiv Detail & Related papers (2024-10-20T19:15:28Z)
On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities. The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations. We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z)
U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics. We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Revisiting Modality Imbalance In Multimodal Pedestrian Detection [6.7841188753203046]
We introduce a novel training setup with regularizer in the multimodal architecture to resolve the problem of this disparity between the modalities. Specifically, our regularizer term helps to make the feature fusion method more robust by considering both the feature extractors equivalently important during the training.
arXiv Detail & Related papers (2023-02-24T11:56:57Z)
Dynamic Feature Regularized Loss for Weakly Supervised Semantic Segmentation [37.43674181562307]
We propose a new regularized loss which utilizes both shallow and deep features that are dynamically updated. Our approach achieves new state-of-the-art performances, outperforming other approaches by a significant margin with more than 6% mIoU increase.
arXiv Detail & Related papers (2021-08-03T05:11:00Z)
Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies [88.0813215220342]
Some modalities can more easily contribute to the classification results than others. We develop a method based on the log-Sobolev inequality, which bounds the functional entropy with the functional-Fisher-information. On the two challenging multi-modal datasets VQA-CPv2 and SocialIQ, we obtain state-of-the-art results while more uniformly exploiting the modalities.
arXiv Detail & Related papers (2020-10-21T07:40:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.