BER: Balanced Error Rate For Speaker Diarization
- URL: http://arxiv.org/abs/2211.04304v1
- Date: Tue, 8 Nov 2022 15:17:39 GMT
- Title: BER: Balanced Error Rate For Speaker Diarization
- Authors: Tao Liu and Kai Yu
- Abstract summary: We propose a Balanced Error Rate (BER) to evaluate speaker diarization.
First, we propose a segment-level error rate (SER) via connected sub-graphs and adaptive IoU threshold to get accurate segment matching.
Second, to evaluate diarization in a unified way, we adopt a speaker-specific harmonic mean between duration and segment, followed by a speaker-weighted average.
- Score: 22.06525303590671
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: DER is the primary metric to evaluate diarization performance while facing a
dilemma: the errors in short utterances or segments tend to be overwhelmed by
longer ones. Short segments, e.g., `yes' or `no,' still have semantic
information. Besides, DER overlooks errors in less-talked speakers. Although
JER balances speaker errors, it still suffers from the same dilemma.
Considering all those aspects, duration error, segment error, and
speaker-weighted error constituting a complete diarization evaluation, we
propose a Balanced Error Rate (BER) to evaluate speaker diarization. First, we
propose a segment-level error rate (SER) via connected sub-graphs and adaptive
IoU threshold to get accurate segment matching. Second, to evaluate diarization
in a unified way, we adopt a speaker-specific harmonic mean between duration
and segment, followed by a speaker-weighted average. Third, we analyze our
metric via the modularized system, EEND, and the multi-modal method on real
datasets. SER and BER are publicly available at https://github.com/X-LANCE/BER.
Related papers
- Speaker Tagging Correction With Non-Autoregressive Language Models [0.0]
We propose a speaker tagging correction system based on a non-autoregressive language model.
We show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets.
arXiv Detail & Related papers (2024-08-30T11:02:17Z) - Lexical Speaker Error Correction: Leveraging Language Models for Speaker
Diarization Error Correction [4.409889336732851]
Speaker diarization (SD) is typically used with an automatic speech recognition (ASR) system to ascribe speaker labels to recognized words.
This approach can lead to speaker errors especially around speaker turns and regions of speaker overlap.
We propose a novel second-pass speaker error correction system using lexical information.
arXiv Detail & Related papers (2023-06-15T17:47:41Z) - ASR Error Detection via Audio-Transcript entailment [1.3750624267664155]
We propose an end-to-end approach for ASR error detection using audio-transcript entailment.
The proposed model utilizes an acoustic encoder and a linguistic encoder to model the speech and transcript respectively.
Our proposed model achieves classification error rates (CER) of 26.2% on all transcription errors and 23% on medical errors specifically, leading to improvements upon a strong baseline by 12% and 15.4%, respectively.
arXiv Detail & Related papers (2022-07-22T02:47:15Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Constructing interval variables via faceted Rasch measurement and
multitask deep learning: a hate speech application [63.10266319378212]
We propose a method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning with the Constructing Measures approach to faceted Rasch item response theory (IRT)
We demonstrate this new method on a dataset of 50,000 social media comments sourced from YouTube, Twitter, and Reddit and labeled by 11,000 U.S.-based Amazon Mechanical Turk workers.
arXiv Detail & Related papers (2020-09-22T02:15:05Z) - Meta-Learning for Short Utterance Speaker Recognition with Imbalance
Length Pairs [65.28795726837386]
We introduce a meta-learning framework for imbalance length pairs.
We train it with a support set of long utterances and a query set of short utterances of varying lengths.
By combining these two learning schemes, our model outperforms existing state-of-the-art speaker verification models.
arXiv Detail & Related papers (2020-04-06T17:53:14Z) - Statistical Context-Dependent Units Boundary Correction for Corpus-based
Unit-Selection Text-to-Speech [1.4337588659482519]
We present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems.
Unlike conventional techniques for speaker adaptation, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques.
arXiv Detail & Related papers (2020-03-05T12:42:13Z) - End-to-End Neural Diarization: Reformulating Speaker Diarization as
Simple Multi-label Classification [45.38809571153867]
We propose the End-to-End Neural Diarization (EEND) in which a neural network directly outputs speaker diarization results.
By feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations.
arXiv Detail & Related papers (2020-02-24T14:53:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.