Self-supervised Speaker Diarization
- URL: http://arxiv.org/abs/2204.04166v1
- Date: Fri, 8 Apr 2022 16:27:14 GMT
- Title: Self-supervised Speaker Diarization
- Authors: Yehoshua Dissen, Felix Kreuk and Joseph Keshet
- Abstract summary: This study proposes an entirely unsupervised deep-learning model for speaker diarization.
Speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker.
- Score: 19.111219197011355
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the last few years, deep learning has grown in popularity for speaker
verification, identification, and diarization. Inarguably, a significant part
of this success is due to the demonstrated effectiveness of their speaker
representations. These, however, are heavily dependent on large amounts of
annotated data and can be sensitive to new domains. This study proposes an
entirely unsupervised deep-learning model for speaker diarization.
Specifically, the study focuses on generating high-quality neural speaker
representations without any annotated data, as well as on estimating secondary
hyperparameters of the model without annotations.
The speaker embeddings are represented by an encoder trained in a
self-supervised fashion using pairs of adjacent segments assumed to be of the
same speaker. The trained encoder model is then used to self-generate
pseudo-labels to subsequently train a similarity score between different
segments of the same call using probabilistic linear discriminant analysis
(PLDA) and further to learn a clustering stopping threshold. We compared our
model to state-of-the-art unsupervised as well as supervised baselines on the
CallHome benchmarks. According to empirical results, our approach outperforms
unsupervised methods when only two speakers are present in the call, and is
only slightly worse than recent supervised models.
Related papers
- Robustness of Speech Separation Models for Similar-pitch Speakers [14.941946672578863]
Single-channel speech separation is a crucial task for enhancing speech recognition systems in multi-speaker environments.
This paper investigates the robustness of state-of-the-art Neural Network models in scenarios where the pitch differences between speakers are minimal.
arXiv Detail & Related papers (2024-07-22T15:55:08Z) - Label-Efficient Self-Supervised Speaker Verification With Information
Maximization and Contrastive Learning [0.0]
We explore self-supervised learning for speaker verification by learning representations directly from raw audio.
Our approach is based on recent information learning frameworks and an intensive data pre-processing step.
arXiv Detail & Related papers (2022-07-12T13:01:55Z) - Automatic Evaluation of Speaker Similarity [0.0]
We introduce a new automatic evaluation method for speaker similarity assessment, consistent with human perceptual scores.
Our experiments show that we can train a model to predict speaker similarity MUSHRA scores from speaker embeddings with 0.96 accuracy and significant correlation up to 0.78 Pearson score at the utterance level.
arXiv Detail & Related papers (2022-07-01T11:23:16Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Raw waveform speaker verification for supervised and self-supervised
learning [30.08242210230669]
This paper proposes a new raw waveform speaker verification model that incorporates techniques proven effective for speaker verification.
Under the best performing configuration, the model shows an equal error rate of 0.89%, competitive with state-of-the-art models.
We also explore the proposed model with a self-supervised learning framework and show the state-of-the-art performance in this line of research.
arXiv Detail & Related papers (2022-03-16T09:28:03Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Learning and Evaluating Representations for Deep One-class
Classification [59.095144932794646]
We present a two-stage framework for deep one-class classification.
We first learn self-supervised representations from one-class data, and then build one-class classifiers on learned representations.
In experiments, we demonstrate state-of-the-art performance on visual domain one-class classification benchmarks.
arXiv Detail & Related papers (2020-11-04T23:33:41Z) - Open-set Short Utterance Forensic Speaker Verification using
Teacher-Student Network with Explicit Inductive Bias [59.788358876316295]
We propose a pipeline solution to improve speaker verification on a small actual forensic field dataset.
By leveraging large-scale out-of-domain datasets, a knowledge distillation based objective function is proposed for teacher-student learning.
We show that the proposed objective function can efficiently improve the performance of teacher-student learning on short utterances.
arXiv Detail & Related papers (2020-09-21T00:58:40Z) - Augmentation adversarial training for self-supervised speaker
recognition [49.47756927090593]
We train robust speaker recognition models without speaker labels.
Experiments on VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision.
arXiv Detail & Related papers (2020-07-23T15:49:52Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.