Towards Uniformity and Alignment for Multimodal Representation Learning
- URL: http://arxiv.org/abs/2602.09507v1
- Date: Tue, 10 Feb 2026 08:08:07 GMT
- Title: Towards Uniformity and Alignment for Multimodal Representation Learning
- Authors: Wenzhe Yin, Pan Zhou, Zehao Xiao, Jie Liu, Shujian Yu, Jan-Jakob Sonke, Efstratios Gavves,
- Abstract summary: We identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases.<n>We propose a principled decoupling of alignment and uniformity for multimodal representations.<n>We then provide a theoretical guarantee that our method acts as an efficient proxy for a global Hlder divergence over multiple modality distributions.
- Score: 66.87764574237532
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multimodal representation learning aims to construct a shared embedding space in which heterogeneous modalities are semantically aligned. Despite strong empirical results, InfoNCE-based objectives introduce inherent conflicts that yield distribution gaps across modalities. In this work, we identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases: (i) an alignment-uniformity conflict, whereby the repulsion of uniformity undermines pairwise alignment, and (ii) an intra-alignment conflict, where aligning multiple modalities induces competing alignment directions. To address these issues, we propose a principled decoupling of alignment and uniformity for multimodal representations, providing a conflict-free recipe for multimodal learning that simultaneously supports discriminative and generative use cases without task-specific modules. We then provide a theoretical guarantee that our method acts as an efficient proxy for a global Hölder divergence over multiple modality distributions, and thus reduces the distribution gap among modalities. Extensive experiments on retrieval and UnCLIP-style generation demonstrate consistent gains.
Related papers
- From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z) - DCG ReID: Disentangling Collaboration and Guidance Fusion Representations for Multi-modal Vehicle Re-Identification [26.52384456327148]
We propose Disentangle Collaboration and Guidance Fusion Representations for Multi-modal Vehicle ReID (DCG-ReID)<n>Specifically, to disentangle heterogeneous quality-distributed modal data without mutual interference, we first design the Dynamic Confidence-based Disentangling Weighting (DCDW) mechanism.<n>We develop two scenario-specific fusion strategies: (1) for balanced quality distributions, Collaboration Fusion Module (CFM) mines pairwise consensus features to capture shared discriminative information and boost intra-class consistency; (2) for unbalanced distributions, Guidance Fusion Module (GFM) implements differential amplification of modal discriminative disparities to reinforce dominant modality
arXiv Detail & Related papers (2026-01-06T11:09:19Z) - Dual-Stream Cross-Modal Representation Learning via Residual Semantic Decorrelation [5.272868130772015]
Cross-modal representations often suffer from modality dominance, redundant information coupling, and spurious cross-modal correlations.<n>We propose a Dual-Stream Residual Semantic Decorrelation Network (DSRSD-Net) to disentangle modality-specific and modality-shared information.
arXiv Detail & Related papers (2025-12-08T14:01:16Z) - UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception [54.53657134205492]
UniAlignment is a unified multimodal generation framework within a single diffusion transformer.<n>It incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness.<n>We present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions.
arXiv Detail & Related papers (2025-09-28T09:11:30Z) - Principled Multimodal Representation Learning [99.53621521696051]
Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities.<n>Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain.<n>We propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities.
arXiv Detail & Related papers (2025-07-23T09:12:25Z) - Offline Multi-agent Reinforcement Learning via Score Decomposition [51.23590397383217]
offline cooperative multi-agent reinforcement learning (MARL) faces unique challenges due to distributional shifts.<n>This work is the first work to explicitly address the distributional gap between offline and online MARL.
arXiv Detail & Related papers (2025-05-09T11:42:31Z) - DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning [18.066105354135058]
Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities.<n>We introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features.<n>Our experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T21:47:48Z) - Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence [83.15764564701706]
We propose a novel framework that performs vision-language alignment by integrating Cauchy-Schwarz divergence with mutual information.<n>We find that the CS divergence seamlessly addresses the InfoNCE's alignment-uniformity conflict and serves complementary roles with InfoNCE.<n> Experiments on text-to-image generation and cross-modality retrieval tasks demonstrate the effectiveness of our method on vision-language alignment.
arXiv Detail & Related papers (2025-02-24T10:29:15Z) - Mutual Information-based Representations Disentanglement for Unaligned Multimodal Language Sequences [25.73415065546444]
Key challenge in unaligned multimodal language sequences is to integrate information from various modalities to obtain a refined multimodal joint representation.
We propose a Mutual Information-based Representations Disentanglement (MIRD) method for unaligned multimodal language sequences.
arXiv Detail & Related papers (2024-09-19T02:12:26Z) - Set-CLIP: Exploring Aligned Semantic From Low-Alignment Multimodal Data Through A Distribution View [35.389116270077324]
Multimodal fusion breaks through the boundaries between diverse modalities and has already achieved notable performances.
In many specialized fields, it is struggling to obtain sufficient alignment data for training.
We propose a new methodology based on CLIP, termed Set-CLIP.
arXiv Detail & Related papers (2024-06-09T12:41:14Z) - Generalizable Heterogeneous Federated Cross-Correlation and Instance
Similarity Learning [60.058083574671834]
This paper presents a novel FCCL+, federated correlation and similarity learning with non-target distillation.
For heterogeneous issue, we leverage irrelevant unlabeled public data for communication.
For catastrophic forgetting in local updating stage, FCCL+ introduces Federated Non Target Distillation.
arXiv Detail & Related papers (2023-09-28T09:32:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.