A Sidecar Separator Can Convert a Single-Speaker Speech Recognition
System to a Multi-Speaker One
- URL: http://arxiv.org/abs/2302.09908v1
- Date: Mon, 20 Feb 2023 11:09:37 GMT
- Title: A Sidecar Separator Can Convert a Single-Speaker Speech Recognition
System to a Multi-Speaker One
- Authors: Lingwei Meng, Jiawen Kang, Mingyu Cui, Yuejiao Wang, Xixin Wu, Helen
Meng
- Abstract summary: We develop a Sidecar separator to empower a well-trained ASR model for multi-speaker scenarios.
The proposed approach outperforms the previous state-of-the-art by a large margin for the 2-speaker mixed LibriMix dataset.
- Score: 40.16292149818563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although automatic speech recognition (ASR) can perform well in common
non-overlapping environments, sustaining performance in multi-speaker
overlapping speech recognition remains challenging. Recent research revealed
that ASR model's encoder captures different levels of information with
different layers -- the lower layers tend to have more acoustic information,
and the upper layers more linguistic. This inspires us to develop a Sidecar
separator to empower a well-trained ASR model for multi-speaker scenarios by
separating the mixed speech embedding between two suitable layers. We
experimented with a wav2vec 2.0-based ASR model with a Sidecar mounted. By
freezing the parameters of the original model and training only the Sidecar
(8.7 M, 8.4% of all parameters), the proposed approach outperforms the previous
state-of-the-art by a large margin for the 2-speaker mixed LibriMix dataset,
reaching a word error rate (WER) of 10.36%; and obtains comparable results
(7.56%) for LibriSpeechMix dataset when limited training.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Unified Modeling of Multi-Talker Overlapped Speech Recognition and
Diarization with a Sidecar Separator [42.8787280791491]
Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization.
We propose a cost-effective method to convert a single-talker automatic speech recognition system into a multi-talker one.
We incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters.
arXiv Detail & Related papers (2023-05-25T17:18:37Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - The Volcspeech system for the ICASSP 2022 multi-channel multi-party
meeting transcription challenge [18.33054364289739]
This paper describes our submission to ICASSP 2022 Multi-channel Multi-party Meeting Transcription (M2MeT) Challenge.
For Track 1, we propose several approaches to empower the clustering-based speaker diarization system.
For Track 2, we develop our system using the Conformer model in a joint CTC-attention architecture.
arXiv Detail & Related papers (2022-02-09T03:38:39Z) - Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number
of Speakers using End-to-End Speaker-Attributed ASR [44.181755224118696]
Transcribe-to-Diarize is a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR)
The proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown.
arXiv Detail & Related papers (2021-10-07T02:48:49Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z) - Streaming Multi-speaker ASR with RNN-T [8.701566919381223]
This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T)
We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T.
Our best model achieves a WER of 10.2% on simulated 2-speaker Libri data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%)
arXiv Detail & Related papers (2020-11-23T19:10:40Z) - Supervised Speaker Embedding De-Mixing in Two-Speaker Environment [37.27421131374047]
Instead of separating a two-speaker signal in signal space like speech source separation, a speaker embedding de-mixing approach is proposed.
The proposed approach separates different speaker properties from a two-speaker signal in embedding space.
arXiv Detail & Related papers (2020-01-14T20:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.