Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System
- URL: http://arxiv.org/abs/2407.09817v2
- Date: Sat, 24 Aug 2024 17:01:19 GMT
- Title: Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System
- Authors: Lingwei Meng, Jiawen Kang, Yuejiao Wang, Zengrui Jin, Xixin Wu, Xunying Liu, Helen Meng,
- Abstract summary: We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks.
Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers.
We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
- Score: 73.34663391495616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, (i) we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers; (ii) a Target Talker Identifier is introduced to identify the embedding flow of the target talker on the fly, requiring only three-second enrollment speech as a cue; (iii) soft prompt tuning for decoder is explored for better task adaptation. Our method outperforms previous methods on two- and three-talker LibriMix and LibriSpeechMix datasets for both tasks, and delivers acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
Related papers
- Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.
Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.
Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition [27.35304346509647]
We introduce speaker labels into an autoregressive transformer-based speech recognition model.
We then propose a novel speaker mask branch to detection the speech segments of individual speakers.
With the proposed model, we can perform both speech recognition and speaker diarization tasks simultaneously.
arXiv Detail & Related papers (2023-12-18T06:29:53Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Unified Modeling of Multi-Talker Overlapped Speech Recognition and
Diarization with a Sidecar Separator [42.8787280791491]
Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization.
We propose a cost-effective method to convert a single-talker automatic speech recognition system into a multi-talker one.
We incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters.
arXiv Detail & Related papers (2023-05-25T17:18:37Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Multi-user VoiceFilter-Lite via Attentive Speaker Embedding [11.321747759474164]
We propose a solution to allow speaker conditioned speech models to support an arbitrary number of enrolled users in a single pass.
This is achieved by using an attention mechanism on multiple speaker embeddings to compute a single attentive embedding.
With up to four enrolled users, multi-user VoiceFilter-Lite is able to significantly reduce speech recognition and speaker verification errors.
arXiv Detail & Related papers (2021-07-02T17:45:37Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.