Related papers: Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios

Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios

URL: http://arxiv.org/abs/2407.01317v1
Date: Mon, 1 Jul 2024 14:26:28 GMT
Title: Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios
Authors: Juan Ignacio Alvarez-Trejos, Beltrán Labrador, Alicia Lozano-Diez,
Abstract summary: End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap. This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities.
Score: 0.9094127664014627
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap. This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities, while maintaining their overlap handling strengths. To achieve this, we propose several methods for incorporating these embeddings along the acoustic features. Furthermore, we delve into an analysis of the correct handling of silence frames, the window length for extracting speaker embeddings and the transformer encoder size. The effectiveness of our proposed approach is thoroughly evaluated on the CallHome dataset for the two-speaker diarization task, with results that demonstrate a significant reduction in diarization error rates achieving a relative improvement of a 10.78% compared to the baseline end-to-end model.

Related papers

Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems. We introduce spoken language understanding modules to extract speaker-related semantic information. We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z)
In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation. We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance. We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z)
Multi-scale Speaker Diarization with Dynamic Scale Weighting [14.473173007997751]
We propose a more advanced multi-scale diarization system based on a multi-scale diarization decoder. Our proposed system achieves a state-of-art performance on the CALLHOME and AMI MixHeadset datasets, with 3.92% and 1.05% diarization error rates, respectively.
arXiv Detail & Related papers (2022-03-30T01:26:31Z)
Disentangled dimensionality reduction for noise-robust speaker diarisation [30.383712356205084]
Speaker embeddings play a crucial role in the performance of diarisation systems. Speaker embeddings often capture spurious information such as noise and reverberation, adversely affecting performance. We propose a novel dimensionality reduction framework that can disentangle spurious information from the speaker embeddings. We also propose the use of a speech/non-speech indicator to prevent the speaker code from learning from the background noise.
arXiv Detail & Related papers (2021-10-07T12:19:09Z)
PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise. We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z)
Adapting Speaker Embeddings for Speaker Diarisation [30.383712356205084]
The goal of this paper is to adapt speaker embeddings for solving the problem of speaker diarisation. We propose three techniques that can be used to better adapt the speaker embeddings for diarisation: dimensionality reduction, attention-based embedding aggregation, and non-speech clustering. The results demonstrate that all three techniques contribute positively to the performance of the diarisation system achieving an average relative improvement of 25.07% in terms of diarisation error rate over the baseline.
arXiv Detail & Related papers (2021-04-07T03:04:47Z)
End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers. Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification. We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z)
Identify Speakers in Cocktail Parties with End-to-End Attention [48.96655134462949]
This paper presents an end-to-end system that integrates speech source extraction and speaker identification. We propose a new way to jointly optimize these two parts by max-pooling the speaker predictions along the channel dimension. End-to-end training results in a system that recognizes one speaker in a two-speaker broadcast speech mixture with 99.9% accuracy and both speakers with 93.9% accuracy.
arXiv Detail & Related papers (2020-05-22T22:15:16Z)
Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition. We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.