Seeing voices and hearing voices: learning discriminative embeddings
using cross-modal self-supervision
- URL: http://arxiv.org/abs/2004.14326v2
- Date: Wed, 6 May 2020 14:56:36 GMT
- Title: Seeing voices and hearing voices: learning discriminative embeddings
using cross-modal self-supervision
- Authors: Soo-Whan Chung, Hong Goo Kang, Joon Son Chung
- Abstract summary: We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks.
We propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities.
The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching.
- Score: 44.88044155505332
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of this work is to train discriminative cross-modal embeddings
without access to manually annotated data. Recent advances in self-supervised
learning have shown that effective representations can be learnt from natural
cross-modal synchrony. We build on earlier work to train embeddings that are
more discriminative for uni-modal downstream tasks. To this end, we propose a
novel training strategy that not only optimises metrics across modalities, but
also enforces intra-class feature separation within each of the modalities. The
effectiveness of the method is demonstrated on two downstream tasks: lip
reading using the features trained on audio-visual synchronisation, and speaker
recognition using the features trained for cross-modal biometric matching. The
proposed method outperforms state-of-the-art self-supervised baselines by a
signficant margin.
Related papers
- DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Robust Task Representations for Offline Meta-Reinforcement Learning via
Contrastive Learning [21.59254848913971]
offline meta-reinforcement learning is a reinforcement learning paradigm that learns from offline data to adapt to new tasks.
We propose a contrastive learning framework for task representations that are robust to the distribution of behavior policies in training and test.
Experiments on a variety of offline meta-reinforcement learning benchmarks demonstrate the advantages of our method over prior methods.
arXiv Detail & Related papers (2022-06-21T14:46:47Z) - Pretext Tasks selection for multitask self-supervised speech
representation learning [23.39079406674442]
This paper introduces a method to select a group of pretext tasks among a set of candidates.
Experiments conducted on speaker recognition and automatic speech recognition validate our approach.
arXiv Detail & Related papers (2021-07-01T16:36:29Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Self-Supervised Relational Reasoning for Representation Learning [5.076419064097733]
In self-supervised learning, a system is tasked with achieving a surrogate objective by defining alternative targets on unlabeled data.
We propose a novel self-supervised formulation of relational reasoning that allows a learner to bootstrap a signal from information implicit in unlabeled data.
We evaluate the proposed method following a rigorous experimental procedure, using standard datasets, protocols, and backbones.
arXiv Detail & Related papers (2020-06-10T14:24:25Z) - Audio-Visual Instance Discrimination with Cross-Modal Agreement [90.95132499006498]
We present a self-supervised learning approach to learn audio-visual representations from video and audio.
We show that optimizing for cross-modal discrimination, rather than within-modal discrimination, is important to learn good representations from video and audio.
arXiv Detail & Related papers (2020-04-27T16:59:49Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.