Disentangled Speech Embeddings using Cross-modal Self-supervision
- URL: http://arxiv.org/abs/2002.08742v2
- Date: Mon, 4 May 2020 15:01:49 GMT
- Title: Disentangled Speech Embeddings using Cross-modal Self-supervision
- Authors: Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman
- Abstract summary: We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
- Score: 119.94362407747437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of this paper is to learn representations of speaker identity
without access to manually annotated data. To do so, we develop a
self-supervised learning objective that exploits the natural cross-modal
synchrony between faces and audio in video. The key idea behind our approach is
to tease apart--without annotation--the representations of linguistic content
and speaker identity. We construct a two-stream architecture which: (1) shares
low-level features common to both representations; and (2) provides a natural
mechanism for explicitly disentangling these factors, offering the potential
for greater generalisation to novel combinations of content and identity and
ultimately producing speaker identity representations that are more robust. We
train our method on a large-scale audio-visual dataset of talking heads `in the
wild', and demonstrate its efficacy by evaluating the learned speaker
representations for standard speaker recognition performance.
Related papers
- Towards the Next Frontier in Speech Representation Learning Using Disentanglement [34.21745744502759]
We propose a framework for Learning Disentangled Self Supervised (termed as Learn2Diss) representations of speech, which consists of frame-level and an utterance-level encoder modules.
We show that the proposed Learn2Diss achieves state-of-the-art results on a variety of tasks, with the frame-level encoder representations improving semantic tasks, while the utterance-level representations improve non-semantic tasks.
arXiv Detail & Related papers (2024-07-02T07:13:35Z) - Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction [17.05599594354308]
Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information.
In the task of target speech extraction, certain elements of global and local semantic information in the reference speech can lead to speaker confusion.
We propose a self-supervised disentangled representation learning method to overcome this challenge.
arXiv Detail & Related papers (2023-12-16T03:48:24Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - An analysis on the effects of speaker embedding choice in non
auto-regressive TTS [4.619541348328938]
We introduce a first attempt on understanding how a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in different speaker embedding sets.
We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well.
arXiv Detail & Related papers (2023-07-19T10:57:54Z) - Disentangling Prosody Representations with Unsupervised Speech
Reconstruction [22.873286925385543]
The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction.
Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec.
We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks.
arXiv Detail & Related papers (2022-12-14T01:37:35Z) - data2vec: A General Framework for Self-supervised Learning in Speech,
Vision and Language [85.9019051663368]
data2vec is a framework that uses the same learning method for either speech, NLP or computer vision.
The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup.
Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance.
arXiv Detail & Related papers (2022-02-07T22:52:11Z) - Speaker Normalization for Self-supervised Speech Emotion Recognition [16.044405846513495]
We propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation.
We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.
arXiv Detail & Related papers (2022-02-02T19:30:47Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.