Adapting Multi-Lingual ASR Models for Handling Multiple Talkers
- URL: http://arxiv.org/abs/2305.18747v1
- Date: Tue, 30 May 2023 05:05:52 GMT
- Title: Adapting Multi-Lingual ASR Models for Handling Multiple Talkers
- Authors: Chenda Li, Yao Qian, Zhuo Chen, Naoyuki Kanda, Dongmei Wang, Takuya
Yoshioka, Yanmin Qian, and Michael Zeng
- Abstract summary: State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages.
We propose an approach to adapt USMs for multi-talker ASR.
We first develop an enhanced version of serialized output training to jointly perform multi-talker ASR and utterance timestamp prediction.
- Score: 63.151811561972515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art large-scale universal speech models (USMs) show a decent
automatic speech recognition (ASR) performance across multiple domains and
languages. However, it remains a challenge for these models to recognize
overlapped speech, which is often seen in meeting conversations. We propose an
approach to adapt USMs for multi-talker ASR. We first develop an enhanced
version of serialized output training to jointly perform multi-talker ASR and
utterance timestamp prediction. That is, we predict the ASR hypotheses for all
speakers, count the speakers, and estimate the utterance timestamps at the same
time. We further introduce a lightweight adapter module to maintain the
multilingual property of the USMs even when we perform the adaptation with only
a single language. Experimental results obtained using the AMI and AliMeeting
corpora show that our proposed approach effectively transfers the USMs to a
strong multilingual multi-talker ASR model with timestamp prediction
capability.
Related papers
- AMPS: ASR with Multimodal Paraphrase Supervision [25.566285376879094]
We present a new technique AMPS that augments a multilingual multimodal ASR system with paraphrase-based supervision for improved conversational ASR in multiple languages.
We use paraphrases of the reference transcriptions as additional supervision while training the multimodal ASR model and selectively invoke this paraphrase objective for utterances with poor ASR performance.
Using AMPS with a state-of-the-art multimodal model SeamlessM4T, we obtain significant relative reductions in word error rates (WERs) of up to 5%.
arXiv Detail & Related papers (2024-11-27T14:16:51Z) - MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately.
This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z) - Advancing Multi-talker ASR Performance with Large Language Models [48.52252970956368]
Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR)
In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM.
Our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI.
arXiv Detail & Related papers (2024-08-30T17:29:25Z) - Efficient Compression of Multitask Multilingual Speech Models [0.0]
DistilWhisper is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities.
Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2.
arXiv Detail & Related papers (2024-05-02T03:11:59Z) - Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech
Models via Language-Specific Experts [14.999359332108767]
We propose DistilWhisper to bridge the performance gap in ASR for under-represented languages.
Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2.
Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters.
arXiv Detail & Related papers (2023-11-02T08:37:30Z) - Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages.
We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.