Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking
- URL: http://arxiv.org/abs/2409.18428v1
- Date: Fri, 27 Sep 2024 03:31:32 GMT
- Title: Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking
- Authors: Brian Yan, Vineel Pratap, Shinji Watanabe, Michael Auli,
- Abstract summary: We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy.
Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
- Score: 68.77659513993507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual Automatic Speech Recognition (ASR) models are typically evaluated in a setting where the ground-truth language of the speech utterance is known, however, this is often not the case for most practical settings. Automatic Spoken Language Identification (SLID) models are not perfect and misclassifications have a substantial impact on the final ASR accuracy. In this paper, we present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy for several prominent acoustic models by employing external features such as language models and text-based language identification models. Our results on FLEURS using the MMS and Whisper models show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively and word error rates which are 3.3% and 2.0% lower on these benchmarks.
Related papers
- Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR [25.566285376879094]
Multimodal models are able to leverage unlabeled text via text-only adaptation with further parameter-efficient ASR fine-tuning.
We show cross-lingual transfer from a high-resource language, achieving up to a relative 17% WER reduction over a baseline in a zero-shot setting.
arXiv Detail & Related papers (2024-10-17T11:19:44Z) - Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition [2.7247388777405597]
We introduce a novel application of weighted cross-entropy, typically used for unbalanced datasets.
We fine-tune the Whisper multilingual ASR model on five high-resource languages and one low-resource language.
arXiv Detail & Related papers (2024-09-25T14:09:09Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Is Attention always needed? A Case Study on Language Identification from
Speech [1.162918464251504]
The present study introduces convolutional recurrent neural network (CRNN) based LID.
CRNN based LID is designed to operate on the Mel-frequency Cepstral Coefficient (MFCC) characteristics of audio samples.
The LID model exhibits high-performance levels ranging from 97% to 100% for languages that are linguistically similar.
arXiv Detail & Related papers (2021-10-05T16:38:57Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z) - Multi-task Language Modeling for Improving Speech Recognition of Rare
Words [14.745696312889763]
We propose a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance.
Our best ASR system with multi-task LM shows 4.6% WERR deduction compared with RNN Transducer only ASR baseline for rare words recognition.
arXiv Detail & Related papers (2020-11-23T20:40:44Z) - Learning to Learn Morphological Inflection for Resource-Poor Languages [105.11499402984482]
We propose to cast the task of morphological inflection - mapping a lemma to an indicated inflected form - for resource-poor languages as a meta-learning problem.
Treating each language as a separate task, we use data from high-resource source languages to learn a set of model parameters.
Experiments with two model architectures on 29 target languages from 3 families show that our suggested approach outperforms all baselines.
arXiv Detail & Related papers (2020-04-28T05:13:17Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.