Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal
Data
- URL: http://arxiv.org/abs/2401.08567v1
- Date: Tue, 16 Jan 2024 18:52:27 GMT
- Title: Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal
Data
- Authors: Yuhui Zhang, Elaine Sui, Serena Yeung-Levy
- Abstract summary: Building cross-modal applications is challenging due to limited paired multi-modal data.
Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data.
We introduce a three-step method, $C3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings.
- Score: 10.908771426089512
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Building cross-modal applications is challenging due to limited paired
multi-modal data. Recent works have shown that leveraging a pre-trained
multi-modal contrastive representation space enables cross-modal tasks to be
learned from uni-modal data. This is based on the assumption that contrastive
optimization makes embeddings from different modalities interchangeable.
However, this assumption is under-explored due to the poorly understood
geometry of the multi-modal contrastive space, where a modality gap exists. In
our study, we provide a theoretical explanation of this space's geometry and
introduce a three-step method, $C^3$ (Connect, Collapse, Corrupt), to bridge
the modality gap, enhancing the interchangeability of embeddings. Our $C^3$
method significantly improves cross-modal learning from uni-modal data,
achieving state-of-the-art results on zero-shot image / audio / video
captioning and text-to-image generation.
Related papers
- Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion [13.696706205837238]
Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications.
We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints.
We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance.
arXiv Detail & Related papers (2025-02-06T17:58:59Z) - MINIMA: Modality Invariant Image Matching [52.505282811925454]
We present MINIMA, a unified image matching framework for multiple cross-modal cases.
We scale up the modalities from cheap but rich RGB-only matching data, by means of generative models.
With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability.
arXiv Detail & Related papers (2024-12-27T02:39:50Z) - Gramian Multimodal Representation Learning and Alignment [5.793118803623239]
We present the novel Gramian Representation Alignment Measure (GRAM)
GRAM learns and aligns $n$ modalities directly in the higher-dimensional space in which modality embeddings lie.
The novel GRAM-based contrastive loss function enhances the alignment of multimodal models in the higher-dimensional embedding space.
arXiv Detail & Related papers (2024-12-16T16:41:51Z) - LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance [10.580712937465032]
We identify the previously ignored gradient conflict between multimodal and unimodal learning objectives.
We propose MMPareto algorithm, which could ensure a final gradient with direction common to all learning objectives.
Our method is also expected to facilitate multi-task cases with a clear discrepancy in task difficulty.
arXiv Detail & Related papers (2024-05-28T01:19:13Z) - Cross-BERT for Point Cloud Pretraining [61.762046503448936]
We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT.
To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction.
Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.
arXiv Detail & Related papers (2023-12-08T08:18:12Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z) - Cross-modal Center Loss [28.509817129759014]
Cross-modal retrieval aims to learn discriminative and modal-invariant features for data from different modalities.
We propose an approach to jointly train the components of cross-modal retrieval framework with metadata.
The proposed framework significantly outperforms the state-of-the-art methods on the ModelNet40 dataset.
arXiv Detail & Related papers (2020-08-08T17:26:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.