When Gradient Optimization Is Not Enough: $\dagger$ Dispersive and Anchoring Geometric Regularizer for Multimodal Learning
- URL: http://arxiv.org/abs/2601.21670v1
- Date: Thu, 29 Jan 2026 13:03:50 GMT
- Title: When Gradient Optimization Is Not Enough: $\dagger$ Dispersive and Anchoring Geometric Regularizer for Multimodal Learning
- Authors: Zixuan Xia, Hao Wang, Pengcheng Weng, Yanyu Qian, Yangxin Xu, William Dan, Fei Wang,
- Abstract summary: We identify representation geometry as a missing control axis in multimodal learning and propose regName, a lightweight geometry-aware regularization framework.<n>regName enforces two complementary constraints on intermediate embeddings: an intra-modal dispersive regularization that promotes representation diversity, and an inter-modal anchoring regularization that bounds sample-level cross-modal drift without rigid alignment.<n>Experiments across multiple multimodal benchmarks demonstrate consistent improvements in both multimodal and unimodal performance, showing that explicitly regulating representation geometry effectively mitigates modality trade-offs.
- Score: 7.598111859541752
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal learning aims to integrate complementary information from heterogeneous modalities, yet strong optimization alone does not guaranty well-structured representations. Even under carefully balanced training schemes, multimodal models often exhibit geometric pathologies, including intra-modal representation collapse and sample-level cross-modal inconsistency, which degrade both unimodal robustness and multimodal fusion. We identify representation geometry as a missing control axis in multimodal learning and propose \regName, a lightweight geometry-aware regularization framework. \regName enforces two complementary constraints on intermediate embeddings: an intra-modal dispersive regularization that promotes representation diversity, and an inter-modal anchoring regularization that bounds sample-level cross-modal drift without rigid alignment. The proposed regularizer is plug-and-play, requires no architectural modifications, and is compatible with various training paradigms. Extensive experiments across multiple multimodal benchmarks demonstrate consistent improvements in both multimodal and unimodal performance, showing that explicitly regulating representation geometry effectively mitigates modality trade-offs.
Related papers
- ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion [14.791563751107502]
ITO is a framework addressing the limitation through two synergistic mechanisms.<n>We show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks.<n>Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer.
arXiv Detail & Related papers (2026-03-03T09:08:53Z) - CLEAR: Null-Space Projection for Cross-Modal De-Redundancy in Multimodal Recommendation [22.71702128773632]
Multimodal recommendation has emerged as an effective paradigm for enhancing collaborative filtering by incorporating heterogeneous content modalities.<n>We propose CLEAR, a cross-modal de-redundancy approach for multimodal recommendation.<n> CLEAR reshapes the representation space by suppressing redundant cross-modal components while preserving modality-specific information.
arXiv Detail & Related papers (2026-03-02T07:06:56Z) - From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z) - Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process [55.91649771370862]
Dirichlet process (DP) mixture model is a powerful non-parametric method that can amplify the most prominent features.<n>We propose a new DP-driven multimodal learning framework that automatically achieves an optimal balance between prominent intra-modal representation learning and cross-modal alignment.
arXiv Detail & Related papers (2025-10-23T16:53:24Z) - UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception [54.53657134205492]
UniAlignment is a unified multimodal generation framework within a single diffusion transformer.<n>It incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness.<n>We present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions.
arXiv Detail & Related papers (2025-09-28T09:11:30Z) - BTW: A Non-Parametric Variance Stabilization Framework for Multimodal Model Integration [20.600001069987318]
We propose Beyond Two-modality Weighting (BTW) to dynamically adjust modality importance during training.<n>BTW computes per-example KL weights by measuring the divergence between each unimodal and the current multimodal prediction.<n>Our method significantly improves regression performance and multiclass classification accuracy.
arXiv Detail & Related papers (2025-08-25T23:00:38Z) - Principled Multimodal Representation Learning [99.53621521696051]
Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities.<n>Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain.<n>We propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities.
arXiv Detail & Related papers (2025-07-23T09:12:25Z) - FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation [57.577843653775]
We propose textbfFindRec (textbfFlexible unified textbfinformation textbfdisentanglement for multi-modal sequential textbfRecommendation)<n>A Stein kernel-based Integrated Information Coordination Module (IICM) theoretically guarantees distribution consistency between multimodal features and ID streams.<n>A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance.
arXiv Detail & Related papers (2025-07-07T04:09:45Z) - DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning [18.066105354135058]
Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities.<n>We introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features.<n>Our experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T21:47:48Z) - Set-CLIP: Exploring Aligned Semantic From Low-Alignment Multimodal Data Through A Distribution View [35.389116270077324]
Multimodal fusion breaks through the boundaries between diverse modalities and has already achieved notable performances.
In many specialized fields, it is struggling to obtain sufficient alignment data for training.
We propose a new methodology based on CLIP, termed Set-CLIP.
arXiv Detail & Related papers (2024-06-09T12:41:14Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.