Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning
- URL: http://arxiv.org/abs/2012.07178v2
- Date: Sun, 14 Feb 2021 05:46:21 GMT
- Title: Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning
- Authors: Wei Xia, Chunlei Zhang, Chao Weng, Meng Yu, Dong Yu
- Abstract summary: We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
- Score: 58.14807331265752
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we investigate self-supervised representation learning for
speaker verification (SV). First, we examine a simple contrastive learning
approach (SimCLR) with a momentum contrastive (MoCo) learning framework, where
the MoCo speaker embedding system utilizes a queue to maintain a large set of
negative examples. We show that better speaker embeddings can be learned by
momentum contrastive learning. Next, alternative augmentation strategies are
explored to normalize extrinsic speaker variabilities of two random segments
from the same speech utterance. Specifically, augmentation in the waveform
largely improves the speaker representations for SV tasks. The proposed MoCo
speaker embedding is further improved when a prototypical memory bank is
introduced, which encourages the speaker embeddings to be closer to their
assigned prototypes with an intermediate clustering step. In addition, we
generalize the self-supervised framework to a semi-supervised scenario where
only a small portion of the data is labeled. Comprehensive experiments on the
Voxceleb dataset demonstrate that our proposed self-supervised approach
achieves competitive performance compared with existing techniques, and can
approach fully supervised results with partially labeled data.
Related papers
- Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Quantitative Evidence on Overlooked Aspects of Enrollment Speaker
Embeddings for Target Speaker Separation [14.013049471563141]
Single channel target speaker separation aims at extracting a speaker's voice from a mixture of multiple talkers given an enrollment utterance of that speaker.
A typical deep learning TSS framework consists of an upstream model that obtains enrollment speaker embeddings and a downstream model that performs the separation conditioned on the embeddings.
arXiv Detail & Related papers (2022-10-23T07:08:46Z) - Label-Efficient Self-Supervised Speaker Verification With Information
Maximization and Contrastive Learning [0.0]
We explore self-supervised learning for speaker verification by learning representations directly from raw audio.
Our approach is based on recent information learning frameworks and an intensive data pre-processing step.
arXiv Detail & Related papers (2022-07-12T13:01:55Z) - Self-supervised Speaker Diarization [19.111219197011355]
This study proposes an entirely unsupervised deep-learning model for speaker diarization.
Speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker.
arXiv Detail & Related papers (2022-04-08T16:27:14Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Bootstrap Equilibrium and Probabilistic Speaker Representation Learning
for Self-supervised Speaker Verification [15.652180150706002]
We propose self-supervised speaker representation learning strategies.
In the front-end, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term.
In the back-end, the probabilistic speaker embeddings are estimated by maximizing the mutual likelihood score between the speech samples belonging to the same speaker.
arXiv Detail & Related papers (2021-12-16T14:55:44Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - Contrastive Separative Coding for Self-supervised Representation
Learning [37.697375719184926]
We propose a self-supervised learning approach, namely Contrastive Separative Coding (CSC)
First, a multi-task separative encoder is built to extract shared separable and discriminative embedding.
Second, we propose a powerful cross-attention mechanism performed over speaker representations across various interfering conditions.
arXiv Detail & Related papers (2021-03-01T07:32:00Z) - Augmentation adversarial training for self-supervised speaker
recognition [49.47756927090593]
We train robust speaker recognition models without speaker labels.
Experiments on VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision.
arXiv Detail & Related papers (2020-07-23T15:49:52Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.