Adapting Multi-Lingual ASR Models for Handling Multiple Talkers
- URL: http://arxiv.org/abs/2305.18747v1
- Date: Tue, 30 May 2023 05:05:52 GMT
- Title: Adapting Multi-Lingual ASR Models for Handling Multiple Talkers
- Authors: Chenda Li, Yao Qian, Zhuo Chen, Naoyuki Kanda, Dongmei Wang, Takuya
Yoshioka, Yanmin Qian, and Michael Zeng
- Abstract summary: State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages.
We propose an approach to adapt USMs for multi-talker ASR.
We first develop an enhanced version of serialized output training to jointly perform multi-talker ASR and utterance timestamp prediction.
- Score: 63.151811561972515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art large-scale universal speech models (USMs) show a decent
automatic speech recognition (ASR) performance across multiple domains and
languages. However, it remains a challenge for these models to recognize
overlapped speech, which is often seen in meeting conversations. We propose an
approach to adapt USMs for multi-talker ASR. We first develop an enhanced
version of serialized output training to jointly perform multi-talker ASR and
utterance timestamp prediction. That is, we predict the ASR hypotheses for all
speakers, count the speakers, and estimate the utterance timestamps at the same
time. We further introduce a lightweight adapter module to maintain the
multilingual property of the USMs even when we perform the adaptation with only
a single language. Experimental results obtained using the AMI and AliMeeting
corpora show that our proposed approach effectively transfers the USMs to a
strong multilingual multi-talker ASR model with timestamp prediction
capability.
Related papers
- Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.
Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.
Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - Advancing Multi-talker ASR Performance with Large Language Models [48.52252970956368]
Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR)
In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM.
Our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI.
arXiv Detail & Related papers (2024-08-30T17:29:25Z) - Efficient Compression of Multitask Multilingual Speech Models [0.0]
DistilWhisper is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities.
Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2.
arXiv Detail & Related papers (2024-05-02T03:11:59Z) - Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech
Models via Language-Specific Experts [14.999359332108767]
We propose DistilWhisper to bridge the performance gap in ASR for under-represented languages.
Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2.
Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters.
arXiv Detail & Related papers (2023-11-02T08:37:30Z) - Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages.
We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Unified Autoregressive Modeling for Joint End-to-End Multi-Talker
Overlapped Speech Recognition and Speaker Attribute Estimation [26.911867847630187]
We present a novel modeling method for single-channel multi-talker overlapped automatic speech recognition (ASR) systems.
We propose unified autoregressive modeling for joint end-to-end multi-talker overlapped ASR and speaker attribute estimation.
arXiv Detail & Related papers (2021-07-04T05:47:18Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.