Extending Multi-modal Contrastive Representations
- URL: http://arxiv.org/abs/2310.08884v1
- Date: Fri, 13 Oct 2023 06:34:23 GMT
- Title: Extending Multi-modal Contrastive Representations
- Authors: Zehan Wang, Ziang Zhang, Luping Liu, Yang Zhao, Haifeng Huang, Tao
Jin, Zhou Zhao
- Abstract summary: Multimodal contrastive representation (MCR) of more than three modalities is critical in multi-modal learning.
Inspired by recent C-MCR, this paper proposes Extending Multimodal Contrastive Representation (Ex-MCR)
Ex-MCR is a training-efficient and paired-data-free method to flexibly learn unified contrastive representation space for more than three modalities.
- Score: 53.923340739349314
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal contrastive representation (MCR) of more than three modalities is
critical in multi-modal learning. Although recent methods showcase impressive
achievements, the high dependence on large-scale, high-quality paired data and
the expensive training costs limit their further development. Inspired by
recent C-MCR, this paper proposes Extending Multimodal Contrastive
Representation (Ex-MCR), a training-efficient and paired-data-free method to
flexibly learn unified contrastive representation space for more than three
modalities by integrating the knowledge of existing MCR spaces. Specifically,
Ex-MCR aligns multiple existing MCRs into the same based MCR, which can
effectively preserve the original semantic alignment of the based MCR. Besides,
we comprehensively enhance the entire learning pipeline for aligning MCR spaces
from the perspectives of training data, architecture, and learning objectives.
With the preserved original modality alignment and the enhanced space
alignment, Ex-MCR shows superior representation learning performance and
excellent modality extensibility. To demonstrate the effectiveness of Ex-MCR,
we align the MCR spaces of CLAP (audio-text) and ULIP (3D-vision) into the CLIP
(vision-text), leveraging the overlapping text and image modality,
respectively. Remarkably, without using any paired data, Ex-MCR learns a
3D-image-text-audio unified contrastive representation, and it achieves
state-of-the-art performance on audio-visual, 3D-image, audio-text, visual-text
retrieval, and 3D object classification tasks. More importantly, extensive
qualitative results further demonstrate the emergent semantic alignment between
the extended modalities (e.g., audio and 3D), which highlights the great
potential of modality extensibility.
Related papers
- Multi-modal Relation Distillation for Unified 3D Representation Learning [30.942281325891226]
Multi-modal Relation Distillation (MRD) is a tri-modal pre-training framework designed to distill reputable large Vision-Language Models (VLM) into 3D backbones.
MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations.
arXiv Detail & Related papers (2024-07-19T03:43:48Z) - Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding [39.55810156545949]
We propose a Multimodal Alignment (DMA) framework to densely co-embed different modalities into a common space.
Our DMA method produces highly competitive open-vocabulary segmentation performance on various indoor and outdoor tasks.
arXiv Detail & Related papers (2024-07-13T05:39:17Z) - Learning to Rank Onset-Occurring-Offset Representations for
Micro-Expression Recognition [24.75382410411772]
This paper focuses on the research of micro-expression recognition (MER)
It proposes a flexible and reliable deep learning method called learning to rank onset--offset representations (LTR3O)
arXiv Detail & Related papers (2023-10-07T03:09:53Z) - Planting a SEED of Vision in Large Language Model [73.17530130368053]
We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the ability to SEE and Draw at the same time.
This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs.
arXiv Detail & Related papers (2023-07-16T13:41:39Z) - Language-free Compositional Action Generation via Decoupling Refinement [67.50452446686725]
We introduce a novel framework to generate compositional actions without reliance on language auxiliaries.
Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement.
arXiv Detail & Related papers (2023-07-07T12:00:38Z) - Connecting Multi-modal Contrastive Representations [50.26161419616139]
Multi-modal Contrastive Representation learning aims to encode different modalities into a semantically shared space.
This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR)
C-MCR achieves audio-visual state-of-the-art performance on audio-image retrieval, audio-visual source localization, and counterfactual audio-image recognition tasks.
arXiv Detail & Related papers (2023-05-22T09:44:39Z) - Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology
Report Generation [48.723504098917324]
We propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments.
We introduce three novel modules: Latent Space Unifier, Cross-modal Representation Aligner and Text-to-Image Refiner.
Experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.
arXiv Detail & Related papers (2023-03-28T12:42:12Z) - SelfCoLearn: Self-supervised collaborative learning for accelerating
dynamic MR imaging [15.575332712603172]
This paper proposes a self-supervised collaborative learning framework (SelfCoLearn) for accurate dynamic MR image reconstruction from undersampled k-space data.
The proposed framework is equipped with three important components, namely, dual-network collaborative learning, reunderampling data augmentation and a specially designed co-training loss.
Results show that our method possesses strong capabilities in capturing essential and inherent representations for direct reconstructions from the undersampled k-space data.
arXiv Detail & Related papers (2022-08-08T04:01:26Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.