A Sidecar Separator Can Convert a Single-Speaker Speech Recognition
System to a Multi-Speaker One
- URL: http://arxiv.org/abs/2302.09908v1
- Date: Mon, 20 Feb 2023 11:09:37 GMT
- Title: A Sidecar Separator Can Convert a Single-Speaker Speech Recognition
System to a Multi-Speaker One
- Authors: Lingwei Meng, Jiawen Kang, Mingyu Cui, Yuejiao Wang, Xixin Wu, Helen
Meng
- Abstract summary: We develop a Sidecar separator to empower a well-trained ASR model for multi-speaker scenarios.
The proposed approach outperforms the previous state-of-the-art by a large margin for the 2-speaker mixed LibriMix dataset.
- Score: 40.16292149818563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although automatic speech recognition (ASR) can perform well in common
non-overlapping environments, sustaining performance in multi-speaker
overlapping speech recognition remains challenging. Recent research revealed
that ASR model's encoder captures different levels of information with
different layers -- the lower layers tend to have more acoustic information,
and the upper layers more linguistic. This inspires us to develop a Sidecar
separator to empower a well-trained ASR model for multi-speaker scenarios by
separating the mixed speech embedding between two suitable layers. We
experimented with a wav2vec 2.0-based ASR model with a Sidecar mounted. By
freezing the parameters of the original model and training only the Sidecar
(8.7 M, 8.4% of all parameters), the proposed approach outperforms the previous
state-of-the-art by a large margin for the 2-speaker mixed LibriMix dataset,
reaching a word error rate (WER) of 10.36%; and obtains comparable results
(7.56%) for LibriSpeechMix dataset when limited training.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.