Related papers: The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

URL: http://arxiv.org/abs/2511.21331v1
Date: Wed, 26 Nov 2025 12:25:55 GMT
Title: The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment
Authors: Stefanos Koutoupis, Michaela Areti Zervou, Konstantinos Kontras, Maarten De Vos, Panagiotis Tsakalides, Grigorios Tsagatakis,
Abstract summary: Contrastive Fusion (ConFu) is a framework that embeds both individual modalities and their fused combinations into a unified representation space.<n>We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity.
Score: 9.00329317378599
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.

Related papers

BrokenBind: Universal Modality Exploration beyond Dataset Boundaries [112.81381711545043]
We introduce BrokenBind, which focuses on binding modalities that are presented from different datasets.<n>Under our framework, any two modalities can be bound together, free from the dataset limitation.
arXiv Detail & Related papers (2026-02-06T07:26:49Z)
Collaboration of Fusion and Independence: Hypercomplex-driven Robust Multi-Modal Knowledge Graph Completion [16.99012641907491]
Multi-modal knowledge graph completion (MMKGC) aims to discover missing facts in multi-modal knowledge graphs (MMKGs)<n>Existing MMKGC methods follow two multi-modal paradigms: fusion-based and ensemble-based.<n>We propose a novel MMKGC method M-Hyper, which achieves the coexistence and collaboration of fused and independent modality representations.
arXiv Detail & Related papers (2025-09-28T07:55:01Z)
Multimodal Representation Learning Conditioned on Semantic Relations [10.999120598129126]
Multimodal representation learning has advanced rapidly with contrastive models such as CLIP.<n>We propose Relation-Conditioned Multimodal Learning RCML, a framework that learns multimodal representations under natural-language relation descriptions.<n>Our approach constructs many-to-many training pairs linked by semantic relations and introduces a relation-guided cross-attention mechanism.
arXiv Detail & Related papers (2025-08-24T19:36:18Z)
Principled Multimodal Representation Learning [99.53621521696051]
Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities.<n>Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain.<n>We propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities.
arXiv Detail & Related papers (2025-07-23T09:12:25Z)
Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval [0.5999777817331317]
Cross-modal image-text retrieval is challenging because of the diverse possible associations between content from different modalities.<n>Traditional methods learn a single-vector embedding to represent semantics of each sample.<n>Set-based approaches, which represent each sample with multiple embeddings, offer a promising alternative.
arXiv Detail & Related papers (2025-06-26T17:55:34Z)
Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection [0.0]
Multi-modal learning has emerged as a crucial research direction.<n>Existing approaches often suffer from insufficient cross-modal interactions and rigid fusion strategies.<n>We propose Co-AttenDWG, co-attention with dimension-wise gating, and expert fusion.<n>We show that Co-AttenDWG achieves state-of-the-art performance and superior cross-modal alignment.
arXiv Detail & Related papers (2025-05-25T07:26:00Z)
Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching [53.05954114863596]
We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching. An anchor branch is first trained to provide insights into the data properties. A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
arXiv Detail & Related papers (2024-04-28T08:44:28Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data. Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds. We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z)
Rethinking Trajectory Prediction via "Team Game" [118.59480535826094]
We present a novel formulation for multi-agent trajectory prediction, which explicitly introduces the concept of interactive group consensus. On two multi-agent settings, i.e. team sports and pedestrians, the proposed framework consistently achieves superior performance compared to existing methods.
arXiv Detail & Related papers (2022-10-17T07:16:44Z)
COBRA: Contrastive Bi-Modal Representation Algorithm [43.33840912256077]
We present a novel framework that aims to train two modalities in a joint fashion inspired by Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE) paradigms. We empirically show that this framework reduces the modality gap significantly and generates a robust and task agnostic joint-embedding space. We outperform existing work on four diverse downstream tasks spanning across seven benchmark cross-modal datasets.
arXiv Detail & Related papers (2020-05-07T18:20:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.