Investigating Confidence Estimation Measures for Speaker Diarization
- URL: http://arxiv.org/abs/2406.17124v1
- Date: Mon, 24 Jun 2024 20:21:38 GMT
- Title: Investigating Confidence Estimation Measures for Speaker Diarization
- Authors: Anurag Chowdhury, Abhinav Misra, Mark C. Fuhs, Monika Woszczyna,
- Abstract summary: Speaker diarization systems segment a conversation recording based on the speakers' identity.
Speaker diarization errors propagate to, and can adversely affect, downstream systems that rely on the speaker's identity.
One way to mitigate these errors is to provide segment-level diarization confidence scores to downstream systems.
- Score: 4.679826697518427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker diarization systems segment a conversation recording based on the speakers' identity. Such systems can misclassify the speaker of a portion of audio due to a variety of factors, such as speech pattern variation, background noise, and overlapping speech. These errors propagate to, and can adversely affect, downstream systems that rely on the speaker's identity, such as speaker-adapted speech recognition. One of the ways to mitigate these errors is to provide segment-level diarization confidence scores to downstream systems. In this work, we investigate multiple methods for generating diarization confidence scores, including those derived from the original diarization system and those derived from an external model. Our experiments across multiple datasets and diarization systems demonstrate that the most competitive confidence score methods can isolate ~30% of the diarization errors within segments with the lowest ~10% of confidence scores.
Related papers
- Speaker Tagging Correction With Non-Autoregressive Language Models [0.0]
We propose a speaker tagging correction system based on a non-autoregressive language model.
We show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets.
arXiv Detail & Related papers (2024-08-30T11:02:17Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Automatic Evaluation of Speaker Similarity [0.0]
We introduce a new automatic evaluation method for speaker similarity assessment, consistent with human perceptual scores.
Our experiments show that we can train a model to predict speaker similarity MUSHRA scores from speaker embeddings with 0.96 accuracy and significant correlation up to 0.78 Pearson score at the utterance level.
arXiv Detail & Related papers (2022-07-01T11:23:16Z) - Self-supervised Speaker Recognition Training Using Human-Machine
Dialogues [22.262550043863445]
We investigate how to pretrain speaker recognition models by leveraging dialogues between customers and smart-speaker devices.
We propose an effective rejection mechanism that selectively learns from dialogues based on their acoustic homogeneity.
Experiments demonstrate that the proposed method provides significant performance improvements, superior to earlier work.
arXiv Detail & Related papers (2022-02-07T19:44:54Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Improving on-device speaker verification using federated learning with
privacy [5.321241042620525]
Information on speaker characteristics can be useful as side information in improving speaker recognition accuracy.
This paper investigates how privacy-preserving learning can improve a speaker verification system.
arXiv Detail & Related papers (2020-08-06T13:37:14Z) - Segment Aggregation for short utterances speaker verification using raw
waveforms [47.41124427552161]
We propose a method that compensates for the performance degradation of speaker verification for short utterances.
The proposed method adopts an ensemble-based design to improve the stability and accuracy of speaker verification systems.
arXiv Detail & Related papers (2020-05-07T08:57:22Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.