Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting
Transcription with Single Distant Microphone
- URL: http://arxiv.org/abs/2103.16776v1
- Date: Wed, 31 Mar 2021 02:43:32 GMT
- Title: Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting
Transcription with Single Distant Microphone
- Authors: Naoyuki Kanda, Guoli Ye, Yu Wu, Yashesh Gaur, Xiaofei Wang, Zhong
Meng, Zhuo Chen, Takuya Yoshioka
- Abstract summary: Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR)
In this paper, we extensively investigate a two-step approach where we first pre-train a serialized output training (SOT)-based multi-talker ASR.
With fine-tuning on the 70 hours of the AMI-SDM training data, our SOT ASR model achieves a word error rate (WER) of 21.2% for the AMI-SDM evaluation set.
- Score: 43.77139614544301
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transcribing meetings containing overlapped speech with only a single distant
microphone (SDM) has been one of the most challenging problems for automatic
speech recognition (ASR). While various approaches have been proposed, all
previous studies on the monaural overlapped speech recognition problem were
based on either simulation data or small-scale real data. In this paper, we
extensively investigate a two-step approach where we first pre-train a
serialized output training (SOT)-based multi-talker ASR by using large-scale
simulation data and then fine-tune the model with a small amount of real
meeting data. Experiments are conducted by utilizing 75 thousand (K) hours of
our internal single-talker recording to simulate a total of 900K hours of
multi-talker audio segments for supervised pre-training. With fine-tuning on
the 70 hours of the AMI-SDM training data, our SOT ASR model achieves a word
error rate (WER) of 21.2% for the AMI-SDM evaluation set while automatically
counting speakers in each test segment. This result is not only significantly
better than the previous state-of-the-art WER of 36.4% with oracle utterance
boundary information but also better than a result by a similarly fine-tuned
single-talker ASR model applied to beamformed audio.
Related papers
- Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription [18.151884620928936]
State-of-the-art end-to-end speaker-attributed automatic speech recognition (SA-ASR) architectures lack a multichannel noise and reverberation reduction front-end.
We introduce a joint beamforming and SA-ASR approach for real meeting transcription.
arXiv Detail & Related papers (2024-10-29T08:17:31Z) - A Multimodal Approach to Device-Directed Speech Detection with Large Language Models [41.37311266840156]
We explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase.
We train classifiers using only acoustic information obtained from the audio waveform.
We take the decoder outputs of an automatic speech recognition system, such as 1-best hypotheses, as input features to a large language model.
arXiv Detail & Related papers (2024-03-21T14:44:03Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Multimodal Data and Resource Efficient Device-Directed Speech Detection
with Large Foundation Models [43.155061160275196]
We explore the possibility of making interactions with virtual assistants more natural by eliminating the need for a trigger phrase.
Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone.
We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder.
arXiv Detail & Related papers (2023-12-06T17:29:03Z) - Adapting Multi-Lingual ASR Models for Handling Multiple Talkers [63.151811561972515]
State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages.
We propose an approach to adapt USMs for multi-talker ASR.
We first develop an enhanced version of serialized output training to jointly perform multi-talker ASR and utterance timestamp prediction.
arXiv Detail & Related papers (2023-05-30T05:05:52Z) - An Experimental Study on Private Aggregation of Teacher Ensemble
Learning for End-to-End Speech Recognition [51.232523987916636]
Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data.
In this work, we extend PATE learning to work with dynamic patterns, namely speech, and perform one very first experimental study on ASR to avoid acoustic data leakage.
arXiv Detail & Related papers (2022-10-11T16:55:54Z) - The RoyalFlush System of Speech Recognition for M2MeT Challenge [5.863625637354342]
This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge.
We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data.
Our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.
arXiv Detail & Related papers (2022-02-03T14:38:26Z) - BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning
for Automatic Speech Recognition [126.5605160882849]
We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency.
We report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks.
arXiv Detail & Related papers (2021-09-27T17:59:19Z) - A Comparative Study of Modular and Joint Approaches for
Speaker-Attributed ASR on Monaural Long-Form Audio [45.04646762560459]
Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings.
Considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data.
We present our recent study on the comparison of such modular and joint approaches towards SA-ASR on real monaural recordings.
arXiv Detail & Related papers (2021-07-06T19:36:48Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.