Countering Multi-modal Representation Collapse through Rank-targeted Fusion
- URL: http://arxiv.org/abs/2511.06450v1
- Date: Sun, 09 Nov 2025 16:34:19 GMT
- Title: Countering Multi-modal Representation Collapse through Rank-targeted Fusion
- Authors: Seulgi Kim, Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib,
- Abstract summary: Multi-modal fusion methods often suffer from two types of representation collapse: feature collapse and modality collapse.<n>We propose a theoretically grounded fusion framework that selectively blends less informative features from one modality with complementary features from another modality.<n>Our approach significantly outperforms prior state-of-the-art methods by up to 3.74%.
- Score: 13.12918046927018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal fusion methods often suffer from two types of representation collapse: feature collapse where individual dimensions lose their discriminative power (as measured by eigenspectra), and modality collapse where one dominant modality overwhelms the other. Applications like human action anticipation that require fusing multifarious sensor data are hindered by both feature and modality collapse. However, existing methods attempt to counter feature collapse and modality collapse separately. This is because there is no unifying framework that efficiently addresses feature and modality collapse in conjunction. In this paper, we posit the utility of effective rank as an informative measure that can be utilized to quantify and counter both the representation collapses. We propose \textit{Rank-enhancing Token Fuser}, a theoretically grounded fusion framework that selectively blends less informative features from one modality with complementary features from another modality. We show that our method increases the effective rank of the fused representation. To address modality collapse, we evaluate modality combinations that mutually increase each others' effective rank. We show that depth maintains representational balance when fused with RGB, avoiding modality collapse. We validate our method on action anticipation, where we present \texttt{R3D}, a depth-informed fusion framework. Extensive experiments on NTURGBD, UTKinect, and DARai demonstrate that our approach significantly outperforms prior state-of-the-art methods by up to 3.74\%. Our code is available at: \href{https://github.com/olivesgatech/R3D}{https://github.com/olivesgatech/R3D}.
Related papers
- Splat Feature Solver [2.385329252971734]
We present a kernel- and feature-agnostic formulation of the feature lifting problem as a sparse linear inverse problem.<n>We introduce two complementary regularization strategies to stabilize the solution and enhance semantic fidelity.<n>Our approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks.
arXiv Detail & Related papers (2025-08-17T03:13:06Z) - A Closer Look at Multimodal Representation Collapse [12.399005128036746]
We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another.<n>We propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities.
arXiv Detail & Related papers (2025-05-28T15:31:53Z) - DADM: Dual Alignment of Domain and Modality for Face Anti-spoofing [58.62312400472865]
Multi-modal face anti-spoofing (FAS) has emerged as a prominent research focus.<n>We propose a alignment module between modalities based on mutual information.<n>We employ a dual alignment optimization method that aligns both sub-domain hyperplanes and modality angle margins.
arXiv Detail & Related papers (2025-03-01T10:12:00Z) - $\ extrm{A}^{\ extrm{2}}$RNet: Adversarial Attack Resilient Network for Robust Infrared and Visible Image Fusion [22.666969931655043]
Infrared and visible image fusion (IVIF) is a crucial technique for enhancing visual performance.<n>We propose a novel adversarial attack resilient network, called $textrmAtextrm2$RNet.
arXiv Detail & Related papers (2024-12-13T08:24:12Z) - Progressively Modality Freezing for Multi-Modal Entity Alignment [27.77877721548588]
We propose a novel strategy of progressive modality freezing, called PMF, that focuses on alignmentrelevant features.
Notably, our approach introduces a pioneering cross-modal association loss to foster modal consistency.
Empirical evaluations across nine datasets confirm PMF's superiority.
arXiv Detail & Related papers (2024-07-23T04:22:30Z) - Asymptotic Midpoint Mixup for Margin Balancing and Moderate Broadening [4.604003661048267]
In the feature space, the collapse between features invokes critical problems in representation learning.
We propose a better feature augmentation method, midpoint mixup.
We empirically analyze the collapse effects by measuring alignment and uniformity with visualizing representations.
arXiv Detail & Related papers (2024-01-26T07:36:57Z) - Explicit Attention-Enhanced Fusion for RGB-Thermal Perception Tasks [13.742299383836256]
We propose a novel fusion method named Explicit Attention-Enhanced Fusion (EAEF) that fully takes advantage of each type of data.
The proposed fusion method outperforms state-of-the-art by 1.6% in mIoU on semantic segmentation, 3.1% in MAE on salient object detection, 2.3% in mAP on object detection, and 8.1% in MAE on crowd counting.
arXiv Detail & Related papers (2023-03-28T03:37:27Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Composed Image Retrieval with Text Feedback via Multi-grained
Uncertainty Regularization [73.04187954213471]
We introduce a unified learning approach to simultaneously modeling the coarse- and fine-grained retrieval.
The proposed method has achieved +4.03%, +3.38%, and +2.40% Recall@50 accuracy over a strong baseline.
arXiv Detail & Related papers (2022-11-14T14:25:40Z) - SFusion: Self-attention based N-to-One Multimodal Fusion Block [6.059397373352718]
We propose a self-attention based fusion block called SFusion.
It learns to fuse available modalities without synthesizing or zero-padding missing ones.
In this work, we apply SFusion to different backbone networks for human activity recognition and brain tumor segmentation tasks.
arXiv Detail & Related papers (2022-08-26T16:42:14Z) - Weakly Aligned Feature Fusion for Multimodal Object Detection [52.15436349488198]
multimodal data often suffer from the position shift problem, i.e., the image pair is not strictly aligned.
This problem makes it difficult to fuse multimodal features and puzzles the convolutional neural network (CNN) training.
In this article, we propose a general multimodal detector named aligned region CNN (AR-CNN) to tackle the position shift problem.
arXiv Detail & Related papers (2022-04-21T02:35:23Z) - DARTS-: Robustly Stepping out of Performance Collapse Without Indicators [74.21019737169675]
Differentiable architecture search suffers from long-standing performance instability.
indicators such as Hessian eigenvalues are proposed as a signal to stop searching before the performance collapses.
In this paper, we undertake a more subtle and direct approach to resolve the collapse.
arXiv Detail & Related papers (2020-09-02T12:54:13Z) - Towards Certified Robustness of Distance Metric Learning [53.96113074344632]
We advocate imposing an adversarial margin in the input space so as to improve the generalization and robustness of metric learning algorithms.
We show that the enlarged margin is beneficial to the generalization ability by using the theoretical technique of algorithmic robustness.
arXiv Detail & Related papers (2020-06-10T16:51:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.