Unified Modeling of Multi-Talker Overlapped Speech Recognition and
Diarization with a Sidecar Separator
- URL: http://arxiv.org/abs/2305.16263v1
- Date: Thu, 25 May 2023 17:18:37 GMT
- Title: Unified Modeling of Multi-Talker Overlapped Speech Recognition and
Diarization with a Sidecar Separator
- Authors: Lingwei Meng, Jiawen Kang, Mingyu Cui, Haibin Wu, Xixin Wu, Helen Meng
- Abstract summary: Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization.
We propose a cost-effective method to convert a single-talker automatic speech recognition system into a multi-talker one.
We incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters.
- Score: 42.8787280791491
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-talker overlapped speech poses a significant challenge for speech
recognition and diarization. Recent research indicated that these two tasks are
inter-dependent and complementary, motivating us to explore a unified modeling
method to address them in the context of overlapped speech. A recent study
proposed a cost-effective method to convert a single-talker automatic speech
recognition (ASR) system into a multi-talker one, by inserting a Sidecar
separator into the frozen well-trained ASR model. Extending on this, we
incorporate a diarization branch into the Sidecar, allowing for unified
modeling of both ASR and diarization with a negligible overhead of only 768
parameters. The proposed method yields better ASR results compared to the
baseline on LibriMix and LibriSpeechMix datasets. Moreover, without
sophisticated customization on the diarization task, our method achieves
acceptable diarization results on the two-speaker subset of CALLHOME with only
a few adaptation steps.
Related papers
- Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks.
Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers.
We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z) - Extending Whisper with prompt tuning to target-speaker ASR [18.31992429200396]
Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech of a target speaker from overlapped utterances.
Most of the existing target-speaker ASR (TS-ASR) methods involve either training from scratch or fully fine-tuning a pre-trained model.
This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR.
arXiv Detail & Related papers (2023-12-13T11:49:16Z) - Mixture Encoder Supporting Continuous Speech Separation for Meeting
Recognition [15.610658840718607]
We propose a mixture encoder to mitigate the effect of artifacts introduced by the speech separation.
We extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps.
Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder.
arXiv Detail & Related papers (2023-09-15T14:57:28Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Mixture Encoder for Joint Speech Separation and Recognition [15.13598115379631]
Multi-speaker automatic speech recognition is crucial for many real-world applications.
Existing approaches can be divided into modular and end-to-end methods.
End-to-end models process overlapped speech directly in a single, powerful neural network.
arXiv Detail & Related papers (2023-06-21T11:01:31Z) - A Sidecar Separator Can Convert a Single-Speaker Speech Recognition
System to a Multi-Speaker One [40.16292149818563]
We develop a Sidecar separator to empower a well-trained ASR model for multi-speaker scenarios.
The proposed approach outperforms the previous state-of-the-art by a large margin for the 2-speaker mixed LibriMix dataset.
arXiv Detail & Related papers (2023-02-20T11:09:37Z) - A Comparative Study on Speaker-attributed Automatic Speech Recognition
in Multi-party Meetings [53.120885867427305]
Three approaches are evaluated for speaker-attributed automatic speech recognition (SA-ASR) in a meeting scenario.
The WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER)
The TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.
arXiv Detail & Related papers (2022-03-31T06:39:14Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Unified Autoregressive Modeling for Joint End-to-End Multi-Talker
Overlapped Speech Recognition and Speaker Attribute Estimation [26.911867847630187]
We present a novel modeling method for single-channel multi-talker overlapped automatic speech recognition (ASR) systems.
We propose unified autoregressive modeling for joint end-to-end multi-talker overlapped ASR and speaker attribute estimation.
arXiv Detail & Related papers (2021-07-04T05:47:18Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.