Massively Multilingual ASR on 70 Languages: Tokenization, Architecture,
and Generalization Capabilities
- URL: http://arxiv.org/abs/2211.05756v1
- Date: Thu, 10 Nov 2022 18:43:42 GMT
- Title: Massively Multilingual ASR on 70 Languages: Tokenization, Architecture,
and Generalization Capabilities
- Authors: Andros Tjandra, Nayan Singhal, David Zhang, Ozlem Kalinli, Abdelrahman
Mohamed, Duc Le, Michael L. Seltzer
- Abstract summary: This paper explores large-scale multilingual ASR models on 70 languages.
We show that our multilingual ASR generalizes well on an unseen dataset and domain, achieving 9.5% and 7.5% WER on Multilingual Librispeech (MLS) with zero-shot and finetuning, respectively.
- Score: 35.15674061731237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end multilingual ASR has become more appealing because of several
reasons such as simplifying the training and deployment process and positive
performance transfer from high-resource to low-resource languages. However,
scaling up the number of languages, total hours, and number of unique tokens is
not a trivial task. This paper explores large-scale multilingual ASR models on
70 languages. We inspect two architectures: (1) Shared embedding and output and
(2) Multiple embedding and output model. In the shared model experiments, we
show the importance of tokenization strategy across different languages. Later,
we use our optimal tokenization strategy to train multiple embedding and output
model to further improve our result. Our multilingual ASR achieves 13.9%-15.6%
average WER relative improvement compared to monolingual models. We show that
our multilingual ASR generalizes well on an unseen dataset and domain,
achieving 9.5% and 7.5% WER on Multilingual Librispeech (MLS) with zero-shot
and finetuning, respectively.
Related papers
- Multilingual De-Duplication Strategies: Applying scalable similarity search with monolingual & multilingual embedding models [0.0]
This paper addresses the deduplication of multilingual textual data using advanced NLP tools.
We compare a two-step method involving translation to English followed by embedding with mpnet, and a multilingual embedding model (distiluse)
arXiv Detail & Related papers (2024-06-19T16:48:14Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Distilling a Pretrained Language Model to a Multilingual ASR Model [3.4012007729454816]
We distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model.
We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
arXiv Detail & Related papers (2022-06-25T12:36:11Z) - Multilingual Transfer Learning for QA Using Translation as Data
Augmentation [13.434957024596898]
We explore strategies that improve cross-lingual transfer by bringing the multilingual embeddings closer in the semantic space.
We propose two novel strategies, language adversarial training and language arbitration framework, which significantly improve the (zero-resource) cross-lingual transfer performance.
Empirically, we show that the proposed models outperform the previous zero-shot baseline on the recently introduced multilingual MLQA and TyDiQA datasets.
arXiv Detail & Related papers (2020-12-10T20:29:34Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters [31.705705891482594]
We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages.
We compare three variants of multilingual training from a single joint model without knowing the input language, to using this information, to multiple heads.
We show that multilingual training of ASR models on several languages can improve recognition performance, in particular, on low resource languages.
arXiv Detail & Related papers (2020-07-06T18:43:38Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.