SheafAlign: A Sheaf-theoretic Framework for Decentralized Multimodal Alignment
- URL: http://arxiv.org/abs/2510.20540v1
- Date: Thu, 23 Oct 2025 13:27:24 GMT
- Title: SheafAlign: A Sheaf-theoretic Framework for Decentralized Multimodal Alignment
- Authors: Abdulmomen Ghalkha, Zhuojun Tian, Chaouki Ben Issaid, Mehdi Bennis,
- Abstract summary: SheafAlign is a sheaf-theoretic framework for decentralized multimodal alignment.<n>SheafAlign overcomes the limitations of prior methods by not requiring mutual redundancy among all modalities.<n>Experiments on multimodal sensing datasets show superior zero-shot generalization, cross-modal alignment, and robustness to missing modalities.
- Score: 23.996765202358223
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional multimodal alignment methods assume mutual redundancy across all modalities, an assumption that fails in real-world distributed scenarios. We propose SheafAlign, a sheaf-theoretic framework for decentralized multimodal alignment that replaces single-space alignment with multiple comparison spaces. This approach models pairwise modality relations through sheaf structures and leverages decentralized contrastive learning-based objectives for training. SheafAlign overcomes the limitations of prior methods by not requiring mutual redundancy among all modalities, preserving both shared and unique information. Experiments on multimodal sensing datasets show superior zero-shot generalization, cross-modal alignment, and robustness to missing modalities, with 50\% lower communication cost than state-of-the-art baselines.
Related papers
- CLEAR: Null-Space Projection for Cross-Modal De-Redundancy in Multimodal Recommendation [22.71702128773632]
Multimodal recommendation has emerged as an effective paradigm for enhancing collaborative filtering by incorporating heterogeneous content modalities.<n>We propose CLEAR, a cross-modal de-redundancy approach for multimodal recommendation.<n> CLEAR reshapes the representation space by suppressing redundant cross-modal components while preserving modality-specific information.
arXiv Detail & Related papers (2026-03-02T07:06:56Z) - Towards Multimodal Domain Generalization with Few Labels [37.21678123296403]
We introduce and study a new problem, Semi-Supervised Multimodal Domain Generalization (SSMDG)<n>SSMDG aims to learn robust multimodal models from multi-source data with few labeled samples.<n>We propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, Disagreement-Aware Regularization, and Cross-Modal Prototype Alignment.
arXiv Detail & Related papers (2026-02-26T12:05:56Z) - Event-Triggered Gossip for Distributed Learning [61.70659996356528]
We develop a new event-triggered gossip framework for distributed learning to reduce inter-node communication.<n>We analyze bf71.61% with only a marginal performance loss, compared with the conventional full-text-of-the-art distributed learning methods.
arXiv Detail & Related papers (2026-02-22T10:13:43Z) - Towards Uniformity and Alignment for Multimodal Representation Learning [66.87764574237532]
We identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases.<n>We propose a principled decoupling of alignment and uniformity for multimodal representations.<n>We then provide a theoretical guarantee that our method acts as an efficient proxy for a global Hlder divergence over multiple modality distributions.
arXiv Detail & Related papers (2026-02-10T08:08:07Z) - BrokenBind: Universal Modality Exploration beyond Dataset Boundaries [112.81381711545043]
We introduce BrokenBind, which focuses on binding modalities that are presented from different datasets.<n>Under our framework, any two modalities can be bound together, free from the dataset limitation.
arXiv Detail & Related papers (2026-02-06T07:26:49Z) - Calibrated Multimodal Representation Learning with Missing Modalities [100.55774771852468]
Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space.<n>Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance.<n>We provide theoretical insights into this issue from an anchor shift perspective.<n>We propose CalMRL for multimodal representation learning to calibrate incomplete alignments caused by missing modalities.
arXiv Detail & Related papers (2025-11-15T05:01:43Z) - NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z) - Efficient Generalization via Multimodal Co-Training under Data Scarcity and Distribution Shift [0.6331016589903705]
multimodal co-training is designed to enhance model generalization in situations where labeled data is limited.<n>We examine the theoretical foundations of this framework, deriving conditions under which the use of unlabeled data leads to significant improvements in generalization.<n>We establish a novel generalization bound that, for the first time in a multimodal co-training context, decomposes and quantifies the advantages gained from leveraging unlabeled multimodal data.
arXiv Detail & Related papers (2025-10-08T20:13:17Z) - Principled Multimodal Representation Learning [70.60542106731813]
Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities.<n>Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain.<n>We propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities.
arXiv Detail & Related papers (2025-07-23T09:12:25Z) - DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning [18.066105354135058]
Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities.<n>We introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features.<n>Our experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T21:47:48Z) - Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence [83.15764564701706]
We propose a novel framework that performs vision-language alignment by integrating Cauchy-Schwarz divergence with mutual information.<n>We find that the CS divergence seamlessly addresses the InfoNCE's alignment-uniformity conflict and serves complementary roles with InfoNCE.<n> Experiments on text-to-image generation and cross-modality retrieval tasks demonstrate the effectiveness of our method on vision-language alignment.
arXiv Detail & Related papers (2025-02-24T10:29:15Z) - Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference [20.761803725098005]
Multimodal variational autoencoders (VAEs) aim to capture shared latent representations by integrating information from different data modalities.
A significant challenge is accurately inferring representations from any subset of modalities without training an impractical number of inference networks for all possible modality combinations.
We introduce multimodal iterative amortized inference, an iterative refinement mechanism within the multimodal VAE framework.
arXiv Detail & Related papers (2024-10-15T08:49:38Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.