Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR
- URL: http://arxiv.org/abs/2011.02921v1
- Date: Tue, 3 Nov 2020 22:28:57 GMT
- Title: Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR
- Authors: Naoyuki Kanda, Zhong Meng, Liang Lu, Yashesh Gaur, Xiaofei Wang, Zhuo
Chen, Takuya Yoshioka
- Abstract summary: We propose a speaker-attributed minimum Bayes risk (SA-MBR) training method to minimize the speaker-attributed word error rate (SA-WER) over the training data.
Experiments using the LibriSpeech corpus show that the proposed SA-MBR training reduces the SA-WER by 9.0 % relative compared with the SA-MMI-trained model.
- Score: 39.36608236418025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, an end-to-end speaker-attributed automatic speech recognition (E2E
SA-ASR) model was proposed as a joint model of speaker counting, speech
recognition and speaker identification for monaural overlapped speech. In the
previous study, the model parameters were trained based on the
speaker-attributed maximum mutual information (SA-MMI) criterion, with which
the joint posterior probability for multi-talker transcription and speaker
identification are maximized over training data. Although SA-MMI training
showed promising results for overlapped speech consisting of various numbers of
speakers, the training criterion was not directly linked to the final
evaluation metric, i.e., speaker-attributed word error rate (SA-WER). In this
paper, we propose a speaker-attributed minimum Bayes risk (SA-MBR) training
method where the parameters are trained to directly minimize the expected
SA-WER over the training data. Experiments using the LibriSpeech corpus show
that the proposed SA-MBR training reduces the SA-WER by 9.0 % relative compared
with the SA-MMI-trained model.
Related papers
- Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications [18.151884620928936]
We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios.
We propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR.
arXiv Detail & Related papers (2024-03-11T10:11:29Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number
of Speakers using End-to-End Speaker-Attributed ASR [44.181755224118696]
Transcribe-to-Diarize is a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR)
The proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown.
arXiv Detail & Related papers (2021-10-07T02:48:49Z) - A Comparative Study of Modular and Joint Approaches for
Speaker-Attributed ASR on Monaural Long-Form Audio [45.04646762560459]
Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings.
Considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data.
We present our recent study on the comparison of such modular and joint approaches towards SA-ASR on real monaural recordings.
arXiv Detail & Related papers (2021-07-06T19:36:48Z) - Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form
Multi-talker Recordings [42.17790794610591]
An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification.
The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers.
It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training.
arXiv Detail & Related papers (2021-01-06T03:36:09Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Investigation of End-To-End Speaker-Attributed ASR for Continuous
Multi-Talker Recordings [40.99930744000231]
We extend the prior work by addressing the case where no speaker profile is available.
We perform speaker counting and clustering by using the internal speaker representations of the E2E SA-ASR model.
We also propose a simple modification to the reference labels of the E2E SA-ASR training which helps handle continuous multi-talker recordings well.
arXiv Detail & Related papers (2020-08-11T06:41:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.