Extending Multi-modal Contrastive Representations
- URL: http://arxiv.org/abs/2310.08884v1
- Date: Fri, 13 Oct 2023 06:34:23 GMT
- Title: Extending Multi-modal Contrastive Representations
- Authors: Zehan Wang, Ziang Zhang, Luping Liu, Yang Zhao, Haifeng Huang, Tao
Jin, Zhou Zhao
- Abstract summary: Multimodal contrastive representation (MCR) of more than three modalities is critical in multi-modal learning.
Inspired by recent C-MCR, this paper proposes Extending Multimodal Contrastive Representation (Ex-MCR)
Ex-MCR is a training-efficient and paired-data-free method to flexibly learn unified contrastive representation space for more than three modalities.
- Score: 53.923340739349314
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal contrastive representation (MCR) of more than three modalities is
critical in multi-modal learning. Although recent methods showcase impressive
achievements, the high dependence on large-scale, high-quality paired data and
the expensive training costs limit their further development. Inspired by
recent C-MCR, this paper proposes Extending Multimodal Contrastive
Representation (Ex-MCR), a training-efficient and paired-data-free method to
flexibly learn unified contrastive representation space for more than three
modalities by integrating the knowledge of existing MCR spaces. Specifically,
Ex-MCR aligns multiple existing MCRs into the same based MCR, which can
effectively preserve the original semantic alignment of the based MCR. Besides,
we comprehensively enhance the entire learning pipeline for aligning MCR spaces
from the perspectives of training data, architecture, and learning objectives.
With the preserved original modality alignment and the enhanced space
alignment, Ex-MCR shows superior representation learning performance and
excellent modality extensibility. To demonstrate the effectiveness of Ex-MCR,
we align the MCR spaces of CLAP (audio-text) and ULIP (3D-vision) into the CLIP
(vision-text), leveraging the overlapping text and image modality,
respectively. Remarkably, without using any paired data, Ex-MCR learns a
3D-image-text-audio unified contrastive representation, and it achieves
state-of-the-art performance on audio-visual, 3D-image, audio-text, visual-text
retrieval, and 3D object classification tasks. More importantly, extensive
qualitative results further demonstrate the emergent semantic alignment between
the extended modalities (e.g., audio and 3D), which highlights the great
potential of modality extensibility.
Related papers
- LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets.
Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples.
Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z) - LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes [55.33167217384738]
LiMoE is a framework that integrates the Mixture of Experts (MoE) paradigm into LiDAR data representation learning.
Our approach consists of three stages: Image-to-LiDAR Pretraining, Contrastive Mixture Learning (CML), and Semantic Mixture Supervision (SMS)
arXiv Detail & Related papers (2025-01-07T18:59:58Z) - Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations? [55.99654128127689]
Cross-modal contrastive distillation has recently been explored for learning effective 3D representations.
Existing methods focus primarily on modality-shared features, neglecting the modality-specific features during the pre-training process.
We propose a new framework, namely CMCR, to address these shortcomings.
arXiv Detail & Related papers (2024-12-12T06:09:49Z) - Multi-modal Relation Distillation for Unified 3D Representation Learning [30.942281325891226]
Multi-modal Relation Distillation (MRD) is a tri-modal pre-training framework designed to distill reputable large Vision-Language Models (VLM) into 3D backbones.
MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations.
arXiv Detail & Related papers (2024-07-19T03:43:48Z) - Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding [39.55810156545949]
We propose a Multimodal Alignment (DMA) framework to densely co-embed different modalities into a common space.
Our DMA method produces highly competitive open-vocabulary segmentation performance on various indoor and outdoor tasks.
arXiv Detail & Related papers (2024-07-13T05:39:17Z) - Multi-View Large Reconstruction Model via Geometry-Aware Positional Encoding and Attention [54.66152436050373]
We propose a Multi-view Large Reconstruction Model (M-LRM) to reconstruct high-quality 3D shapes from multi-views in a 3D-aware manner.
Specifically, we introduce a multi-view consistent cross-attention scheme to enable M-LRM to accurately query information from the input images.
Compared to previous methods, the proposed M-LRM can generate 3D shapes of high fidelity.
arXiv Detail & Related papers (2024-06-11T18:29:13Z) - Learning to Rank Onset-Occurring-Offset Representations for
Micro-Expression Recognition [24.75382410411772]
This paper focuses on the research of micro-expression recognition (MER)
It proposes a flexible and reliable deep learning method called learning to rank onset--offset representations (LTR3O)
arXiv Detail & Related papers (2023-10-07T03:09:53Z) - Connecting Multi-modal Contrastive Representations [50.26161419616139]
Multi-modal Contrastive Representation learning aims to encode different modalities into a semantically shared space.
This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR)
C-MCR achieves audio-visual state-of-the-art performance on audio-image retrieval, audio-visual source localization, and counterfactual audio-image recognition tasks.
arXiv Detail & Related papers (2023-05-22T09:44:39Z) - SelfCoLearn: Self-supervised collaborative learning for accelerating
dynamic MR imaging [15.575332712603172]
This paper proposes a self-supervised collaborative learning framework (SelfCoLearn) for accurate dynamic MR image reconstruction from undersampled k-space data.
The proposed framework is equipped with three important components, namely, dual-network collaborative learning, reunderampling data augmentation and a specially designed co-training loss.
Results show that our method possesses strong capabilities in capturing essential and inherent representations for direct reconstructions from the undersampled k-space data.
arXiv Detail & Related papers (2022-08-08T04:01:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.