From Modular to End-to-End Speaker Diarization
- URL: http://arxiv.org/abs/2407.08752v1
- Date: Thu, 27 Jun 2024 15:09:39 GMT
- Title: From Modular to End-to-End Speaker Diarization
- Authors: Federico Landini,
- Abstract summary: We describe a system based on a Bayesian hidden Markov model used to cluster x-vectors (speaker embeddings obtained with a neural network), known as VBx.
We describe an approach for generating synthetic data which resembles real conversations in terms of speaker turns and overlaps.
We show how this method generating simulated conversations'' allows for better performance than using a previously proposed method for creating simulated mixtures'' when training the popular EEND.
- Score: 3.079020586262228
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Speaker diarization is usually referred to as the task that determines ``who spoke when'' in a recording. Until a few years ago, all competitive approaches were modular. Systems based on this framework reached state-of-the-art performance in most scenarios but had major difficulties dealing with overlapped speech. More recently, the advent of end-to-end models, capable of dealing with all aspects of speaker diarization with a single model and better performing regarding overlapped speech, has brought high levels of attention. This thesis is framed during a period of co-existence of these two trends. We describe a system based on a Bayesian hidden Markov model used to cluster x-vectors (speaker embeddings obtained with a neural network), known as VBx, which has shown remarkable performance on different datasets and challenges. We comment on its advantages and limitations and evaluate results on different relevant corpora. Then, we move towards end-to-end neural diarization (EEND) methods. Due to the need for large training sets for training these models and the lack of manually annotated diarization data in sufficient quantities, the compromise solution consists in generating training data artificially. We describe an approach for generating synthetic data which resembles real conversations in terms of speaker turns and overlaps. We show how this method generating ``simulated conversations'' allows for better performance than using a previously proposed method for creating ``simulated mixtures'' when training the popular EEND with encoder-decoder attractors (EEND-EDA). We also propose a new EEND-based model, which we call DiaPer, and show that it can perform better than EEND-EDA, especially when dealing with many speakers and handling overlapped speech. Finally, we compare both VBx-based and DiaPer systems on a wide variety of corpora and comment on the advantages of each technique.
Related papers
- Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - Pre-training Multi-party Dialogue Models with Latent Discourse Inference [85.9683181507206]
We pre-train a model that understands the discourse structure of multi-party dialogues, namely, to whom each utterance is replying.
To fully utilize the unlabeled data, we propose to treat the discourse structures as latent variables, then jointly infer them and pre-train the discourse-aware model.
arXiv Detail & Related papers (2023-05-24T14:06:27Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Unified Autoregressive Modeling for Joint End-to-End Multi-Talker
Overlapped Speech Recognition and Speaker Attribute Estimation [26.911867847630187]
We present a novel modeling method for single-channel multi-talker overlapped automatic speech recognition (ASR) systems.
We propose unified autoregressive modeling for joint end-to-end multi-talker overlapped ASR and speaker attribute estimation.
arXiv Detail & Related papers (2021-07-04T05:47:18Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition.
Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints.
Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z) - Integrating end-to-end neural and clustering-based diarization: Getting
the best of both worlds [71.36164750147827]
Clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors.
End-to-end neural diarization (EEND) directly predicts diarization labels using a neural network.
We propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
arXiv Detail & Related papers (2020-10-26T06:33:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.