Connecting Multi-modal Contrastive Representations
- URL: http://arxiv.org/abs/2305.14381v2
- Date: Thu, 19 Oct 2023 02:55:13 GMT
- Title: Connecting Multi-modal Contrastive Representations
- Authors: Zehan Wang, Yang Zhao, Xize Cheng, Haifeng Huang, Jiageng Liu, Li
Tang, Linjun Li, Yongqi Wang, Aoxiong Yin, Ziang Zhang, Zhou Zhao
- Abstract summary: Multi-modal Contrastive Representation learning aims to encode different modalities into a semantically shared space.
This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR)
C-MCR achieves audio-visual state-of-the-art performance on audio-image retrieval, audio-visual source localization, and counterfactual audio-image recognition tasks.
- Score: 50.26161419616139
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal Contrastive Representation learning aims to encode different
modalities into a semantically aligned shared space. This paradigm shows
remarkable generalization ability on numerous downstream tasks across various
modalities. However, the reliance on massive high-quality data pairs limits its
further development on more modalities. This paper proposes a novel
training-efficient method for learning MCR without paired data called
Connecting Multi-modal Contrastive Representations (C-MCR). Specifically, given
two existing MCRs pre-trained on (A, B) and (B, C) modality pairs, we project
them to a new space and use the data from the overlapping modality B to
aligning the two MCRs in the new space. Meanwhile, since the modality pairs (A,
B) and (B, C) are already aligned within each MCR, the connection learned by
overlapping modality can also be transferred to non-overlapping modality pair
(A, C). To unleash the potential of C-MCR, we further introduce a
semantic-enhanced inter- and intra-MCR connection method. We first enhance the
semantic consistency and completion of embeddings across different modalities
for more robust alignment. Then we utilize the inter-MCR alignment to establish
the connection, and employ the intra-MCR alignment to better maintain the
connection for inputs from non-overlapping modalities. To demonstrate the
effectiveness of C-MCR, we connect CLIP and CLAP via texts to derive
audio-visual representations, and integrate CLIP and ULIP via images for
3D-language representations. Remarkably, without using any paired data, C-MCR
for audio-visual achieves state-of-the-art performance on audio-image
retrieval, audio-visual source localization, and counterfactual audio-image
recognition tasks. Furthermore, C-MCR for 3D-language also attains advanced
zero-shot 3D point cloud classification accuracy on ModelNet40.
Related papers
- Multi-modal Relation Distillation for Unified 3D Representation Learning [30.942281325891226]
Multi-modal Relation Distillation (MRD) is a tri-modal pre-training framework designed to distill reputable large Vision-Language Models (VLM) into 3D backbones.
MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations.
arXiv Detail & Related papers (2024-07-19T03:43:48Z) - Multimodal Cross-Document Event Coreference Resolution Using Linear Semantic Transfer and Mixed-Modality Ensembles [8.233126457964834]
Event coreference resolution (ECR) is the task of determining whether distinct mentions of events are actually linked to the same underlying occurrence.
Here, we propose a multimodal cross-document event coreference resolution method that integrates visual and textual cues with a simple linear map between vision and language models.
Our results demonstrate the utility of multimodal information in ECR for certain challenging coreference problems.
arXiv Detail & Related papers (2024-04-13T10:01:58Z) - Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet)
AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition.
Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z) - Extending Multi-modal Contrastive Representations [53.923340739349314]
Multimodal contrastive representation (MCR) of more than three modalities is critical in multi-modal learning.
Inspired by recent C-MCR, this paper proposes Extending Multimodal Contrastive Representation (Ex-MCR)
Ex-MCR is a training-efficient and paired-data-free method to flexibly learn unified contrastive representation space for more than three modalities.
arXiv Detail & Related papers (2023-10-13T06:34:23Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Plug-and-Play Regulators for Image-Text Matching [76.28522712930668]
Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching.
We develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations.
Experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models.
arXiv Detail & Related papers (2023-03-23T15:42:05Z) - LMR-CBT: Learning Modality-fused Representations with CB-Transformer for
Multimodal Emotion Recognition from Unaligned Multimodal Sequences [5.570499497432848]
We propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition.
We conduct word-aligned and unaligned experiments on three challenging datasets.
arXiv Detail & Related papers (2021-12-03T03:43:18Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.