Related papers: Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

URL: http://arxiv.org/abs/2401.08567v1
Date: Tue, 16 Jan 2024 18:52:27 GMT
Title: Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data
Authors: Yuhui Zhang, Elaine Sui, Serena Yeung-Levy
Abstract summary: Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. We introduce a three-step method, $C3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings.
Score: 10.908771426089512
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, $C^3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our $C^3$ method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video captioning and text-to-image generation.

Related papers

Improving Multimodal Learning Balance and Sufficiency through Data Remixing [14.282792733217653]
Methods for enforcing the weak modality fail to achieve unimodal sufficiency and multimodal balance.<n>We propose multimodal Data Remixing, including decoupling multimodal data and filtering hard samples for each modality to mitigate modality imbalance.<n>Our method can be seamlessly integrated with existing approaches, improving accuracy by approximately 6.50%$uparrow$ on CREMAD and 3.41%$uparrow$ on Kinetic-Sounds.
arXiv Detail & Related papers (2025-06-13T08:01:29Z)
Continual Multimodal Contrastive Learning [70.60542106731813]
Multimodal contrastive learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space. However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive. In this paper, we formulate CMCL through two specialized principles of stability and plasticity. We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge.
arXiv Detail & Related papers (2025-03-19T07:57:08Z)
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion [13.696706205837238]
Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints. We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance.
arXiv Detail & Related papers (2025-02-06T17:58:59Z)
MINIMA: Modality Invariant Image Matching [52.505282811925454]
We present MINIMA, a unified image matching framework for multiple cross-modal cases. We scale up the modalities from cheap but rich RGB-only matching data, by means of generative models. With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability.
arXiv Detail & Related papers (2024-12-27T02:39:50Z)
Gramian Multimodal Representation Learning and Alignment [5.793118803623239]
We present the novel Gramian Representation Alignment Measure (GRAM) GRAM learns and aligns $n$ modalities directly in the higher-dimensional space in which modality embeddings lie. The novel GRAM-based contrastive loss function enhances the alignment of multimodal models in the higher-dimensional embedding space.
arXiv Detail & Related papers (2024-12-16T16:41:51Z)
LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities. PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z)
MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance [10.580712937465032]
We identify the previously ignored gradient conflict between multimodal and unimodal learning objectives. We propose MMPareto algorithm, which could ensure a final gradient with direction common to all learning objectives. Our method is also expected to facilitate multi-task cases with a clear discrepancy in task difficulty.
arXiv Detail & Related papers (2024-05-28T01:19:13Z)
Cross-BERT for Point Cloud Pretraining [61.762046503448936]
We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT. To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction. Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.
arXiv Detail & Related papers (2023-12-08T08:18:12Z)
Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning. MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process. It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities. Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z)
Deep Metric Loss for Multimodal Learning [3.8979646385036175]
We introduce a novel textMultiModal loss paradigm for multimodal learning. textMultiModal loss can prevent inefficient learning caused by overfitting and efficiently optimize multimodal models. Our loss is empirically shown to improve the performance of recent models.
arXiv Detail & Related papers (2023-08-21T06:04:30Z)
Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z)
CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images. We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z)
Multi-Modal Mutual Information Maximization: A Novel Approach for Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH) We learn informative representations that can preserve both intra- and inter-modal similarities. The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z)
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning [28.89401350391015]
We propose a unified-modal pre-training architecture, namely UNIMO, which can adapt to both single-modal and multi-modal understanding and generation tasks. As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations.
arXiv Detail & Related papers (2020-12-31T02:46:47Z)
Cross-modal Center Loss [28.509817129759014]
Cross-modal retrieval aims to learn discriminative and modal-invariant features for data from different modalities. We propose an approach to jointly train the components of cross-modal retrieval framework with metadata. The proposed framework significantly outperforms the state-of-the-art methods on the ModelNet40 dataset.
arXiv Detail & Related papers (2020-08-08T17:26:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.