A Comparative Study of Modular and Joint Approaches for
Speaker-Attributed ASR on Monaural Long-Form Audio
- URL: http://arxiv.org/abs/2107.02852v1
- Date: Tue, 6 Jul 2021 19:36:48 GMT
- Title: A Comparative Study of Modular and Joint Approaches for
Speaker-Attributed ASR on Monaural Long-Form Audio
- Authors: Naoyuki Kanda, Xiong Xiao, Jian Wu, Tianyan Zhou, Yashesh Gaur,
Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka
- Abstract summary: Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings.
Considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data.
We present our recent study on the comparison of such modular and joint approaches towards SA-ASR on real monaural recordings.
- Score: 45.04646762560459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker-attributed automatic speech recognition (SA-ASR) is a task to
recognize "who spoke what" from multi-talker recordings. An SA-ASR system
usually consists of multiple modules such as speech separation, speaker
diarization and ASR. On the other hand, considering the joint optimization, an
end-to-end (E2E) SA-ASR model has recently been proposed with promising results
on simulation data. In this paper, we present our recent study on the
comparison of such modular and joint approaches towards SA-ASR on real monaural
recordings. We develop state-of-the-art SA-ASR systems for both modular and
joint approaches by leveraging large-scale training data, including 75 thousand
hours of ASR training data and the VoxCeleb corpus for speaker representation
learning. We also propose a new pipeline that performs the E2E SA-ASR model
after speaker clustering. Our evaluation on the AMI meeting corpus reveals that
after fine-tuning with a small real data, the joint system performs 9.2--29.4%
better in accuracy compared to the best modular system while the modular system
performs better before such fine-tuning. We also conduct various error analyses
to show the remaining issues for the monaural SA-ASR.
Related papers
- Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription [18.151884620928936]
State-of-the-art end-to-end speaker-attributed automatic speech recognition (SA-ASR) architectures lack a multichannel noise and reverberation reduction front-end.
We introduce a joint beamforming and SA-ASR approach for real meeting transcription.
arXiv Detail & Related papers (2024-10-29T08:17:31Z) - Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications [18.151884620928936]
We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios.
We propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR.
arXiv Detail & Related papers (2024-03-11T10:11:29Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Analyzing And Improving Neural Speaker Embeddings for ASR [54.30093015525726]
We present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system.
Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.
arXiv Detail & Related papers (2023-01-11T16:56:03Z) - End-to-End Multi-speaker ASR with Independent Vector Analysis [80.83577165608607]
We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.
We propose a paradigm for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm.
arXiv Detail & Related papers (2022-04-01T05:45:33Z) - Feature Replacement and Combination for Hybrid ASR Systems [47.74348197215634]
We investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems.
In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features.
We obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.
arXiv Detail & Related papers (2021-04-09T11:04:58Z) - Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting
Transcription with Single Distant Microphone [43.77139614544301]
Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR)
In this paper, we extensively investigate a two-step approach where we first pre-train a serialized output training (SOT)-based multi-talker ASR.
With fine-tuning on the 70 hours of the AMI-SDM training data, our SOT ASR model achieves a word error rate (WER) of 21.2% for the AMI-SDM evaluation set.
arXiv Detail & Related papers (2021-03-31T02:43:32Z) - Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form
Multi-talker Recordings [42.17790794610591]
An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification.
The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers.
It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training.
arXiv Detail & Related papers (2021-01-06T03:36:09Z) - Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR [39.36608236418025]
We propose a speaker-attributed minimum Bayes risk (SA-MBR) training method to minimize the speaker-attributed word error rate (SA-WER) over the training data.
Experiments using the LibriSpeech corpus show that the proposed SA-MBR training reduces the SA-WER by 9.0 % relative compared with the SA-MMI-trained model.
arXiv Detail & Related papers (2020-11-03T22:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.