Multimodal Clustering Networks for Self-supervised Learning from
Unlabeled Videos
- URL: http://arxiv.org/abs/2104.12671v1
- Date: Mon, 26 Apr 2021 15:55:01 GMT
- Title: Multimodal Clustering Networks for Self-supervised Learning from
Unlabeled Videos
- Authors: Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel
Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David
Harwath, James Glass, Michael Picheny, Shih-Fu Chang
- Abstract summary: This paper proposes a self-supervised training framework that learns a common multimodal embedding space.
We extend the concept of instance-level contrastive learning with a multimodal clustering step to capture semantic similarities across modalities.
The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains.
- Score: 69.61522804742427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal self-supervised learning is getting more and more attention as it
allows not only to train large networks without human supervision but also to
search and retrieve data across various modalities. In this context, this paper
proposes a self-supervised training framework that learns a common multimodal
embedding space that, in addition to sharing representations across different
modalities, enforces a grouping of semantically similar instances. To this end,
we extend the concept of instance-level contrastive learning with a multimodal
clustering step in the training pipeline to capture semantic similarities
across modalities. The resulting embedding space enables retrieval of samples
across all modalities, even from unseen datasets and different domains. To
evaluate our approach, we train our model on the HowTo100M dataset and evaluate
its zero-shot retrieval capabilities in two challenging domains, namely
text-to-video retrieval, and temporal action localization, showing
state-of-the-art results on four different datasets.
Related papers
- Reinforcement Learning Based Multi-modal Feature Fusion Network for
Novel Class Discovery [47.28191501836041]
In this paper, we employ a Reinforcement Learning framework to simulate the cognitive processes of humans.
We also deploy a Member-to-Leader Multi-Agent framework to extract and fuse features from multi-modal information.
We demonstrate the performance of our approach in both the 3D and 2D domains by employing the OS-MN40, OS-MN40-Miss, and Cifar10 datasets.
arXiv Detail & Related papers (2023-08-26T07:55:32Z) - Preserving Modality Structure Improves Multi-Modal Learning [64.10085674834252]
Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings without relying on human annotations.
These methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings.
We propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space.
arXiv Detail & Related papers (2023-08-24T20:46:48Z) - Multi-network Contrastive Learning Based on Global and Local
Representations [4.190134425277768]
This paper proposes a multi-network contrastive learning framework based on global and local representations.
We introduce global and local feature information for self-supervised contrastive learning through multiple networks.
The framework also expands the number of samples used for contrast and improves the training efficiency of the model.
arXiv Detail & Related papers (2023-06-28T05:30:57Z) - Semi-supervised Multimodal Representation Learning through a Global Workspace [2.8948274245812335]
"Global Workspace" is a shared representation for two input modalities.
This architecture is amenable to self-supervised training via cycle-consistency.
We show that such an architecture can be trained to align and translate between two modalities with very little need for matched data.
arXiv Detail & Related papers (2023-06-27T12:41:36Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Leveraging Ensembles and Self-Supervised Learning for Fully-Unsupervised
Person Re-Identification and Text Authorship Attribution [77.85461690214551]
Learning from fully-unlabeled data is challenging in Multimedia Forensics problems, such as Person Re-Identification and Text Authorship Attribution.
Recent self-supervised learning methods have shown to be effective when dealing with fully-unlabeled data in cases where the underlying classes have significant semantic differences.
We propose a strategy to tackle Person Re-Identification and Text Authorship Attribution by enabling learning from unlabeled data even when samples from different classes are not prominently diverse.
arXiv Detail & Related papers (2022-02-07T13:08:11Z) - Semi-supervised Multi-task Learning for Semantics and Depth [88.77716991603252]
Multi-Task Learning (MTL) aims to enhance the model generalization by sharing representations between related tasks for better performance.
We propose the Semi-supervised Multi-Task Learning (MTL) method to leverage the available supervisory signals from different datasets.
We present a domain-aware discriminator structure with various alignment formulations to mitigate the domain discrepancy issue among datasets.
arXiv Detail & Related papers (2021-10-14T07:43:39Z) - Generalized Zero-Shot Learning using Multimodal Variational Auto-Encoder
with Semantic Concepts [0.9054540533394924]
Recent techniques try to learn a cross-modal mapping between the semantic space and the image space.
We propose a Multimodal Variational Auto-Encoder (M-VAE) which can learn the shared latent space of image features and the semantic space.
Our results show that our proposed model outperforms the current state-of-the-art approaches for generalized zero-shot learning.
arXiv Detail & Related papers (2021-06-26T20:08:37Z) - DiVA: Diverse Visual Feature Aggregation for Deep Metric Learning [83.48587570246231]
Visual Similarity plays an important role in many computer vision applications.
Deep metric learning (DML) is a powerful framework for learning such similarities.
We propose and study multiple complementary learning tasks, targeting conceptually different data relationships.
We learn a single model to aggregate their training signals, resulting in strong generalization and state-of-the-art performance.
arXiv Detail & Related papers (2020-04-28T12:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.