Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization
- URL: http://arxiv.org/abs/2403.14286v1
- Date: Thu, 21 Mar 2024 10:49:54 GMT
- Title: Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization
- Authors: Nikhil Raghav, Md Sahidullah,
- Abstract summary: This study thoroughly examines spectral clustering for both same-domain and cross-domain speaker diarization.
We observe that the performance difference between two different domain conditions can be attributed to the role of spectral clustering.
- Score: 7.052822052763606
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components. Moreover, the robustness of speaker diarization across various datasets hasn't been explored when the development and evaluation data are from different domains. To bridge this gap, this study thoroughly examines spectral clustering for both same-domain and cross-domain speaker diarization. Our extensive experiments on two widely used corpora, AMI and DIHARD, reveal the performance trend of speaker diarization in the presence of domain mismatch. We observe that the performance difference between two different domain conditions can be attributed to the role of spectral clustering. In particular, keeping other modules unchanged, we show that differences in optimal tuning parameters as well as speaker count estimation originates due to the mismatch. This study opens several future directions for speaker diarization research.
Related papers
- MK-SGC-SC: Multiple Kernel Guided Sparse Graph Construction in Spectral Clustering for Unsupervised Speaker Diarization [25.78243411853038]
Speaker diarization aims to segment audio recordings into regions corresponding to individual speakers.<n>In this work, we share the notable observation that measuring multiple kernel similarities of speaker embeddings is sufficient to craft a sparse graph for spectral clustering.<n> Experiments show the proposed approach excels in unsupervised speaker diarization over a variety of challenging environments in the DIHARD-III, AMI, and VoxConverse corpora.
arXiv Detail & Related papers (2026-01-24T09:51:08Z) - Robustness of Speech Separation Models for Similar-pitch Speakers [14.941946672578863]
Single-channel speech separation is a crucial task for enhancing speech recognition systems in multi-speaker environments.
This paper investigates the robustness of state-of-the-art Neural Network models in scenarios where the pitch differences between speakers are minimal.
arXiv Detail & Related papers (2024-07-22T15:55:08Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Self-supervised Speaker Diarization [19.111219197011355]
This study proposes an entirely unsupervised deep-learning model for speaker diarization.
Speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker.
arXiv Detail & Related papers (2022-04-08T16:27:14Z) - Disentangled dimensionality reduction for noise-robust speaker
diarisation [30.383712356205084]
Speaker embeddings play a crucial role in the performance of diarisation systems.
Speaker embeddings often capture spurious information such as noise and reverberation, adversely affecting performance.
We propose a novel dimensionality reduction framework that can disentangle spurious information from the speaker embeddings.
We also propose the use of a speech/non-speech indicator to prevent the speaker code from learning from the background noise.
arXiv Detail & Related papers (2021-10-07T12:19:09Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - DEAAN: Disentangled Embedding and Adversarial Adaptation Network for
Robust Speaker Representation Learning [69.70594547377283]
We propose a novel framework to disentangle speaker-related and domain-specific features.
Our framework can effectively generate more speaker-discriminative and domain-invariant speaker representations.
arXiv Detail & Related papers (2020-12-12T19:46:56Z) - Single channel voice separation for unknown number of speakers under
reverberant and noisy settings [106.48335929548875]
We present a unified network for voice separation of an unknown number of speakers.
The proposed approach is composed of several separation heads optimized together with a speaker classification branch.
We present a new noisy and reverberant dataset of up to five different speakers speaking simultaneously.
arXiv Detail & Related papers (2020-11-04T14:59:14Z) - DNN Speaker Tracking with Embeddings [0.0]
We propose a novel embedding-based speaker tracking method.
Our design is based on a convolutional neural network that mimics a typical speaker verification PLDA.
To make the baseline system similar to speaker tracking, non-target speakers were added to the recordings.
arXiv Detail & Related papers (2020-07-13T18:40:14Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.