Related papers: Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

URL: http://arxiv.org/abs/2004.14326v2
Date: Wed, 6 May 2020 14:56:36 GMT
Title: Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision
Authors: Soo-Whan Chung, Hong Goo Kang, Joon Son Chung
Abstract summary: We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. We propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching.
Score: 44.88044155505332
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching. The proposed method outperforms state-of-the-art self-supervised baselines by a signficant margin.

Related papers

DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective. By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form. Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z)
Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm. Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z)
Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning [21.59254848913971]
offline meta-reinforcement learning is a reinforcement learning paradigm that learns from offline data to adapt to new tasks. We propose a contrastive learning framework for task representations that are robust to the distribution of behavior policies in training and test. Experiments on a variety of offline meta-reinforcement learning benchmarks demonstrate the advantages of our method over prior methods.
arXiv Detail & Related papers (2022-06-21T14:46:47Z)
Pretext Tasks selection for multitask self-supervised speech representation learning [23.39079406674442]
This paper introduces a method to select a group of pretext tasks among a set of candidates. Experiments conducted on speaker recognition and automatic speech recognition validate our approach.
arXiv Detail & Related papers (2021-07-01T16:36:29Z)
LiRA: Learning Visual Speech Representations from Audio through Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA) Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z)
Self-Supervised Relational Reasoning for Representation Learning [5.076419064097733]
In self-supervised learning, a system is tasked with achieving a surrogate objective by defining alternative targets on unlabeled data. We propose a novel self-supervised formulation of relational reasoning that allows a learner to bootstrap a signal from information implicit in unlabeled data. We evaluate the proposed method following a rigorous experimental procedure, using standard datasets, protocols, and backbones.
arXiv Detail & Related papers (2020-06-10T14:24:25Z)
Audio-Visual Instance Discrimination with Cross-Modal Agreement [90.95132499006498]
We present a self-supervised learning approach to learn audio-visual representations from video and audio. We show that optimizing for cross-modal discrimination, rather than within-modal discrimination, is important to learn good representations from video and audio.
arXiv Detail & Related papers (2020-04-27T16:59:49Z)
Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.