OptiMAG: Structure-Semantic Alignment via Unbalanced Optimal Transport
- URL: http://arxiv.org/abs/2601.22856v1
- Date: Fri, 30 Jan 2026 11:29:03 GMT
- Title: OptiMAG: Structure-Semantic Alignment via Unbalanced Optimal Transport
- Authors: Yilong Zuo, Xunkai Li, Zhihan Zhang, Qiangqiang Dai, Ronghua Li, Guoren Wang,
- Abstract summary: Multimodal Attributed Graphs (MAGs) have been widely adopted for modeling complex systems by integrating multi-modal information, such as text and images, on nodes.<n>We identify a discrepancy between the implicit semantic structure induced by different modality embeddings and the explicit graph structure.<n>Since existing methods typically perform message passing over the fixed explicit graph structure, they inadvertently aggregate dissimilar features.<n>We propose OptiMAG, an Unbalanced Optimal Transport-based regularization framework.
- Score: 37.640303159988015
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Attributed Graphs (MAGs) have been widely adopted for modeling complex systems by integrating multi-modal information, such as text and images, on nodes. However, we identify a discrepancy between the implicit semantic structure induced by different modality embeddings and the explicit graph structure. For instance, neighbors in the explicit graph structure may be close in one modality but distant in another. Since existing methods typically perform message passing over the fixed explicit graph structure, they inadvertently aggregate dissimilar features, introducing modality-specific noise and impeding effective node representation learning. To address this, we propose OptiMAG, an Unbalanced Optimal Transport-based regularization framework. OptiMAG employs the Fused Gromov-Wasserstein distance to explicitly guide cross-modal structural consistency within local neighborhoods, effectively mitigating structural-semantic conflicts. Moreover, a KL divergence penalty enables adaptive handling of cross-modal inconsistencies. This framework can be seamlessly integrated into existing multimodal graph models, acting as an effective drop-in regularizer. Experiments demonstrate that OptiMAG consistently outperforms baselines across multiple tasks, ranging from graph-centric tasks (e.g., node classification, link prediction) to multimodal-centric generation tasks (e.g., graph2text, graph2image). The source code will be available upon acceptance.
Related papers
- Mario: Multimodal Graph Reasoning with Large Language Models [10.232888977666418]
Mario is a graph-conditioned VLM that refines textual and visual features through fine-grained cross-modal contrastive learning.<n>Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction.
arXiv Detail & Related papers (2026-03-05T13:49:41Z) - Toward Effective Multimodal Graph Foundation Model: A Divide-and-Conquer Based Approach [42.970648490410504]
Multimodal Graph Foundation Models (MGFMs) allow for leveraging the rich multimodal information in Multimodal-Attributed Graphs (MAGs)<n>We propose PLANET, a novel framework employing a Divide-and-Conquer strategy to decouple modality interaction and alignment across distinct granularities.<n>We show that PLANET significantly outperforms state-of-the-art baselines across diverse graph-centric and multimodal generative tasks.
arXiv Detail & Related papers (2026-02-04T01:05:12Z) - Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models [84.78794648147608]
A persistent geometric anomaly, the Modality Gap, remains.<n>Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions.<n>We propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap into stable biases and anisotropic residuals.<n>We then introduce ReAlign, a training-free modality alignment strategy.
arXiv Detail & Related papers (2026-02-02T13:59:39Z) - LION: A Clifford Neural Paradigm for Multimodal-Attributed Graph Learning [36.90213853456115]
We propose LION to implement alignment-then-fusion in multimodal-attributed graphs.<n>We first construct a modality-aware geometric manifold grounded in Clifford algebra.<n>This geometric-induced high-order graph propagation efficiently achieves modality interaction, facilitating modality alignment.
arXiv Detail & Related papers (2026-01-29T09:30:36Z) - Decoupling and Damping: Structurally-Regularized Gradient Matching for Multimodal Graph Condensation [3.2987327415317895]
We propose Structurally-Regularized Gradient Matching (SR-GM), a novel condensation framework tailored for multimodal graphs.<n> SR-GM significantly improves accuracy and accelerates convergence compared to baseline methods.<n>This research provides a scalable methodology for multimodal graph-based learning in resource-constrained environments.
arXiv Detail & Related papers (2025-11-25T11:50:34Z) - Preventing Representational Rank Collapse in MPNNs by Splitting the Computational Graph [9.498398257062641]
We show that operating on multiple directed acyclic graphs always satisfies our condition and propose to obtain these by defining a strict partial ordering of the nodes.<n>We conduct comprehensive experiments that confirm the benefits of operating on multi-relational graphs to achieve more informative node representations.
arXiv Detail & Related papers (2024-09-17T19:16:03Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Mitigating Modality Collapse in Multimodal VAEs via Impartial
Optimization [7.4262579052708535]
We argue that this effect is a consequence of conflicting gradients during multimodal VAE training.
We show how to detect the sub-graphs in the computational graphs where gradients conflict.
We empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.
arXiv Detail & Related papers (2022-06-09T13:29:25Z) - A Flexible Framework for Designing Trainable Priors with Adaptive
Smoothing and Game Encoding [57.1077544780653]
We introduce a general framework for designing and training neural network layers whose forward passes can be interpreted as solving non-smooth convex optimization problems.
We focus on convex games, solved by local agents represented by the nodes of a graph and interacting through regularization functions.
This approach is appealing for solving imaging problems, as it allows the use of classical image priors within deep models that are trainable end to end.
arXiv Detail & Related papers (2020-06-26T08:34:54Z) - Graph Optimal Transport for Cross-Domain Alignment [121.80313648519203]
Cross-domain alignment is fundamental to computer vision and natural language processing.
We propose Graph Optimal Transport (GOT), a principled framework that germinates from recent advances in Optimal Transport (OT)
Experiments show consistent outperformance of GOT over baselines across a wide range of tasks.
arXiv Detail & Related papers (2020-06-26T01:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.