Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype
Contrast
- URL: http://arxiv.org/abs/2204.14057v2
- Date: Mon, 2 May 2022 01:58:12 GMT
- Title: Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype
Contrast
- Authors: Boqing Zhu, Kele Xu, Changjian Wang, Zheng Qin, Tao Sun, Huaimin Wang,
Yuxing Peng
- Abstract summary: We present an approach to learn voice-face representations from the talking face videos, without any identity labels.
Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face.
We propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives.
- Score: 34.58856143210749
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present an approach to learn voice-face representations from the talking
face videos, without any identity labels. Previous works employ cross-modal
instance discrimination tasks to establish the correlation of voice and face.
These methods neglect the semantic content of different videos, introducing
false-negative pairs as training noise. Furthermore, the positive pairs are
constructed based on the natural correlation between audio clips and visual
frames. However, this correlation might be weak or inaccurate in a large amount
of real-world data, which leads to deviating positives into the contrastive
paradigm. To address these issues, we propose the cross-modal prototype
contrastive learning (CMPC), which takes advantage of contrastive methods and
resists adverse effects of false negatives and deviate positives. On one hand,
CMPC could learn the intra-class invariance by constructing semantic-wise
positives via unsupervised clustering in different modalities. On the other
hand, by comparing the similarities of cross-modal instances from that of
cross-modal prototypes, we dynamically recalibrate the unlearnable instances'
contribution to overall loss. Experiments show that the proposed approach
outperforms state-of-the-art unsupervised methods on various voice-face
association evaluation protocols. Additionally, in the low-shot supervision
setting, our method also has a significant improvement compared to previous
instance-wise contrastive learning.
Related papers
- KDMCSE: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning [31.139620652818838]
We propose KDMCSE, a novel approach that enhances the discrimination and generalizability of multimodal representation.
We also introduce a new contrastive objective, AdapACSE, that enhances the discriminative representation by strengthening the margin within the angular space.
arXiv Detail & Related papers (2024-03-26T08:32:39Z) - DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - MarginNCE: Robust Sound Localization with a Negative Margin [23.908770938403503]
The goal of this work is to localize sound sources in visual scenes with a self-supervised approach.
We show that using a less strict decision boundary in contrastive learning can alleviate the effect of noisy correspondences in sound source localization.
arXiv Detail & Related papers (2022-11-03T16:44:14Z) - Extending Momentum Contrast with Cross Similarity Consistency
Regularization [5.085461418671174]
We present Extended Momentum Contrast, a self-supervised representation learning method founded upon the legacy of the momentum-encoder unit proposed in the MoCo family configurations.
Under the cross consistency regularization rule, we argue that semantic representations associated with any pair of images (positive or negative) should preserve their cross-similarity.
We report a competitive performance on the standard Imagenet-1K linear head classification benchmark.
arXiv Detail & Related papers (2022-06-07T20:06:56Z) - Robust Contrastive Learning against Noisy Views [79.71880076439297]
We propose a new contrastive loss function that is robust against noisy views.
We show that our approach provides consistent improvements over the state-of-the-art image, video, and graph contrastive learning benchmarks.
arXiv Detail & Related papers (2022-01-12T05:24:29Z) - Similarity Contrastive Estimation for Self-Supervised Soft Contrastive
Learning [0.41998444721319206]
We argue that a good data representation contains the relations, or semantic similarity, between the instances.
We propose a novel formulation of contrastive learning using semantic similarity between instances called Similarity Contrastive Estimation (SCE)
Our training objective can be considered as soft contrastive learning.
arXiv Detail & Related papers (2021-11-29T15:19:15Z) - Contrastive Learning for Fair Representations [50.95604482330149]
Trained classification models can unintentionally lead to biased representations and predictions.
Existing debiasing methods for classification models, such as adversarial training, are often expensive to train and difficult to optimise.
We propose a method for mitigating bias by incorporating contrastive learning, in which instances sharing the same class label are encouraged to have similar representations.
arXiv Detail & Related papers (2021-09-22T10:47:51Z) - Incremental False Negative Detection for Contrastive Learning [95.68120675114878]
We introduce a novel incremental false negative detection for self-supervised contrastive learning.
During contrastive learning, we discuss two strategies to explicitly remove the detected false negatives.
Our proposed method outperforms other self-supervised contrastive learning frameworks on multiple benchmarks within a limited compute.
arXiv Detail & Related papers (2021-06-07T15:29:14Z) - Robust Audio-Visual Instance Discrimination [79.74625434659443]
We present a self-supervised learning method to learn audio and video representations.
We address the problems of audio-visual instance discrimination and improve transfer learning performance.
arXiv Detail & Related papers (2021-03-29T19:52:29Z) - Audio-Visual Instance Discrimination with Cross-Modal Agreement [90.95132499006498]
We present a self-supervised learning approach to learn audio-visual representations from video and audio.
We show that optimizing for cross-modal discrimination, rather than within-modal discrimination, is important to learn good representations from video and audio.
arXiv Detail & Related papers (2020-04-27T16:59:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.