Investigation of End-To-End Speaker-Attributed ASR for Continuous
Multi-Talker Recordings
- URL: http://arxiv.org/abs/2008.04546v1
- Date: Tue, 11 Aug 2020 06:41:55 GMT
- Title: Investigation of End-To-End Speaker-Attributed ASR for Continuous
Multi-Talker Recordings
- Authors: Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng,
Zhuo Chen, Takuya Yoshioka
- Abstract summary: We extend the prior work by addressing the case where no speaker profile is available.
We perform speaker counting and clustering by using the internal speaker representations of the E2E SA-ASR model.
We also propose a simple modification to the reference labels of the E2E SA-ASR training which helps handle continuous multi-talker recordings well.
- Score: 40.99930744000231
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, an end-to-end (E2E) speaker-attributed automatic speech recognition
(SA-ASR) model was proposed as a joint model of speaker counting, speech
recognition and speaker identification for monaural overlapped speech. It
showed promising results for simulated speech mixtures consisting of various
numbers of speakers. However, the model required prior knowledge of speaker
profiles to perform speaker identification, which significantly limited the
application of the model. In this paper, we extend the prior work by addressing
the case where no speaker profile is available. Specifically, we perform
speaker counting and clustering by using the internal speaker representations
of the E2E SA-ASR model to diarize the utterances of the speakers whose
profiles are missing from the speaker inventory. We also propose a simple
modification to the reference labels of the E2E SA-ASR training which helps
handle continuous multi-talker recordings well. We conduct a comprehensive
investigation of the original E2E SA-ASR and the proposed method on the
monaural LibriCSS dataset. Compared to the original E2E SA-ASR with relevant
speaker profiles, the proposed method achieves a close performance without any
prior speaker knowledge. We also show that the source-target attention in the
E2E SA-ASR model provides information about the start and end times of the
hypotheses.
Related papers
- Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications [18.151884620928936]
We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios.
We propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR.
arXiv Detail & Related papers (2024-03-11T10:11:29Z) - One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number
of Speakers using End-to-End Speaker-Attributed ASR [44.181755224118696]
Transcribe-to-Diarize is a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR)
The proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown.
arXiv Detail & Related papers (2021-10-07T02:48:49Z) - Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form
Multi-talker Recordings [42.17790794610591]
An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification.
The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers.
It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training.
arXiv Detail & Related papers (2021-01-06T03:36:09Z) - Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR [39.36608236418025]
We propose a speaker-attributed minimum Bayes risk (SA-MBR) training method to minimize the speaker-attributed word error rate (SA-WER) over the training data.
Experiments using the LibriSpeech corpus show that the proposed SA-MBR training reduces the SA-WER by 9.0 % relative compared with the SA-MMI-trained model.
arXiv Detail & Related papers (2020-11-03T22:28:57Z) - Speaker Separation Using Speaker Inventories and Estimated Speech [78.57067876891253]
We propose speaker separation using speaker inventories (SSUSI) and speaker separation using estimated speech (SSUES)
By combining the advantages of permutation invariant training (PIT) and speech extraction, SSUSI significantly outperforms conventional approaches.
arXiv Detail & Related papers (2020-10-20T18:15:45Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.