Scaling End-to-End Models for Large-Scale Multilingual ASR
- URL: http://arxiv.org/abs/2104.14830v1
- Date: Fri, 30 Apr 2021 08:24:11 GMT
- Title: Scaling End-to-End Models for Large-Scale Multilingual ASR
- Authors: Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang, James
Qin, Parisa Haghani, W. Ronny Huang, Min Ma
- Abstract summary: Building ASR models across many language families is a challenging multi-task learning problem due to large language variations and heavily unbalanced data.
We conduct a capacity study on a 15-language task, with the amount of data per language varying from 7.7K to 54.7K hours.
- Score: 44.89961662796597
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Building ASR models across many language families is a challenging multi-task
learning problem due to large language variations and heavily unbalanced data.
Existing work has shown positive transfer from high resource to low resource
languages. However, degradations on high resource languages are commonly
observed due to interference from the heterogeneous multilingual data and
reduction in per-language capacity. We conduct a capacity study on a
15-language task, with the amount of data per language varying from 7.7K to
54.7K hours. We adopt GShard [1] to efficiently scale up to 10B parameters.
Empirically, we find that (1) scaling the number of model parameters is an
effective way to solve the capacity bottleneck - our 500M-param model is
already better than monolingual baselines and scaling it to 1B and 10B brought
further quality gains; (2) larger models are not only more data efficient, but
also more efficient in terms of training cost as measured in TPU days - the
1B-param model reaches the same accuracy at 34% of training time as the
500M-param model; (3) given a fixed capacity budget, adding depth usually works
better than width and large encoders tend to do better than large decoders.
Related papers
- Scaling Laws for Multilingual Language Models [41.6318470003173]
A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer.
We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio.
We derive a power-law relationship that links performance with dataset size, model size and sampling ratios.
arXiv Detail & Related papers (2024-10-15T20:29:38Z) - InkubaLM: A small language model for low-resource African languages [9.426968756845389]
InkubaLM is a small language model with 0.4 billion parameters.
It achieves performance comparable to models with significantly larger parameter counts.
It demonstrates remarkable consistency across multiple languages.
arXiv Detail & Related papers (2024-08-30T05:42:31Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - Relay Decoding: Concatenating Large Language Models for Machine Translation [21.367605327742027]
We propose an innovative approach called RD (Relay Decoding), which entails concatenating two distinct large models that individually support the source and target languages.
By incorporating a simple mapping layer to facilitate the connection between these two models and utilizing a limited amount of parallel data for training, we successfully achieve superior results in the machine translation task.
arXiv Detail & Related papers (2024-05-05T13:42:25Z) - ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot
Multilingual Information Retrieval [10.664434993386523]
Current approaches circumvent the lack of high-quality labeled data in non-English languages.
We present a novel modular dense retrieval model that learns from the rich data of a single high-resource language.
arXiv Detail & Related papers (2024-02-23T02:21:24Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model)
GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z) - Scaling ASR Improves Zero and Few Shot Learning [23.896440724468246]
We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets.
By training 1-10B parameter universal English ASR models, we push the limits of speech recognition performance across many domains.
For speakers with disorders due to brain damage, our best zero-shot and few-shot models achieve 22% and 60% relative improvement on the AphasiaBank test set, respectively.
arXiv Detail & Related papers (2021-11-10T21:18:59Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.