Multi-talker ASR for an unknown number of sources: Joint training of
source counting, separation and ASR
- URL: http://arxiv.org/abs/2006.02786v3
- Date: Mon, 21 Dec 2020 12:27:40 GMT
- Title: Multi-talker ASR for an unknown number of sources: Joint training of
source counting, separation and ASR
- Authors: Thilo von Neumann, Christoph Boeddeker, Lukas Drude, Keisuke
Kinoshita, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach
- Abstract summary: We develop an end-to-end multi-talker automatic speech recognition system for an unknown number of active speakers.
Our experiments show very promising performance in counting accuracy, source separation and speech recognition.
Our system generalizes well to a larger number of speakers than it ever saw during training.
- Score: 91.87500543591945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most approaches to multi-talker overlapped speech separation and recognition
assume that the number of simultaneously active speakers is given, but in
realistic situations, it is typically unknown. To cope with this, we extend an
iterative speech extraction system with mechanisms to count the number of
sources and combine it with a single-talker speech recognizer to form the first
end-to-end multi-talker automatic speech recognition system for an unknown
number of active speakers. Our experiments show very promising performance in
counting accuracy, source separation and speech recognition on simulated clean
mixtures from WSJ0-2mix and WSJ0-3mix. Among others, we set a new
state-of-the-art word error rate on the WSJ0-2mix database. Furthermore, our
system generalizes well to a larger number of speakers than it ever saw during
training, as shown in experiments with the WSJ0-4mix database.
Related papers
- Some voices are too common: Building fair speech recognition systems
using the Common Voice dataset [2.28438857884398]
We use the French Common Voice dataset to quantify the biases of a pre-trained wav2vec2.0 model toward several demographic groups.
We also run an in-depth analysis of the Common Voice corpus and identify important shortcomings that should be taken into account.
arXiv Detail & Related papers (2023-06-01T11:42:34Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - End-to-End Multi-speaker ASR with Independent Vector Analysis [80.83577165608607]
We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.
We propose a paradigm for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm.
arXiv Detail & Related papers (2022-04-01T05:45:33Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - Time-Domain Speech Extraction with Spatial Information and Multi Speaker
Conditioning Mechanism [27.19635746008699]
We present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture.
The proposed method is built on an improved multi-channel time-domain speech separation network.
Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline.
arXiv Detail & Related papers (2021-02-07T10:11:49Z) - Multi-task Language Modeling for Improving Speech Recognition of Rare
Words [14.745696312889763]
We propose a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance.
Our best ASR system with multi-task LM shows 4.6% WERR deduction compared with RNN Transducer only ASR baseline for rare words recognition.
arXiv Detail & Related papers (2020-11-23T20:40:44Z) - Streaming Multi-speaker ASR with RNN-T [8.701566919381223]
This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T)
We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T.
Our best model achieves a WER of 10.2% on simulated 2-speaker Libri data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%)
arXiv Detail & Related papers (2020-11-23T19:10:40Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.