Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal
Data
- URL: http://arxiv.org/abs/2401.08567v1
- Date: Tue, 16 Jan 2024 18:52:27 GMT
- Title: Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal
Data
- Authors: Yuhui Zhang, Elaine Sui, Serena Yeung-Levy
- Abstract summary: Building cross-modal applications is challenging due to limited paired multi-modal data.
Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data.
We introduce a three-step method, $C3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings.
- Score: 10.908771426089512
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Building cross-modal applications is challenging due to limited paired
multi-modal data. Recent works have shown that leveraging a pre-trained
multi-modal contrastive representation space enables cross-modal tasks to be
learned from uni-modal data. This is based on the assumption that contrastive
optimization makes embeddings from different modalities interchangeable.
However, this assumption is under-explored due to the poorly understood
geometry of the multi-modal contrastive space, where a modality gap exists. In
our study, we provide a theoretical explanation of this space's geometry and
introduce a three-step method, $C^3$ (Connect, Collapse, Corrupt), to bridge
the modality gap, enhancing the interchangeability of embeddings. Our $C^3$
method significantly improves cross-modal learning from uni-modal data,
achieving state-of-the-art results on zero-shot image / audio / video
captioning and text-to-image generation.
Related papers
- LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance [10.580712937465032]
We identify the previously ignored gradient conflict between multimodal and unimodal learning objectives.
We propose MMPareto algorithm, which could ensure a final gradient with direction common to all learning objectives.
Our method is also expected to facilitate multi-task cases with a clear discrepancy in task difficulty.
arXiv Detail & Related papers (2024-05-28T01:19:13Z) - Cross-BERT for Point Cloud Pretraining [61.762046503448936]
We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT.
To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction.
Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.
arXiv Detail & Related papers (2023-12-08T08:18:12Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Deep Metric Loss for Multimodal Learning [3.8979646385036175]
We introduce a novel textMultiModal loss paradigm for multimodal learning.
textMultiModal loss can prevent inefficient learning caused by overfitting and efficiently optimize multimodal models.
Our loss is empirically shown to improve the performance of recent models.
arXiv Detail & Related papers (2023-08-21T06:04:30Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z) - UNIMO: Towards Unified-Modal Understanding and Generation via
Cross-Modal Contrastive Learning [28.89401350391015]
We propose a unified-modal pre-training architecture, namely UNIMO, which can adapt to both single-modal and multi-modal understanding and generation tasks.
As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations.
arXiv Detail & Related papers (2020-12-31T02:46:47Z) - Cross-modal Center Loss [28.509817129759014]
Cross-modal retrieval aims to learn discriminative and modal-invariant features for data from different modalities.
We propose an approach to jointly train the components of cross-modal retrieval framework with metadata.
The proposed framework significantly outperforms the state-of-the-art methods on the ModelNet40 dataset.
arXiv Detail & Related papers (2020-08-08T17:26:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.