Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual
Machine Translation
- URL: http://arxiv.org/abs/2212.07571v1
- Date: Thu, 15 Dec 2022 01:06:55 GMT
- Title: Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual
Machine Translation
- Authors: Maha Elbayad and Anna Sun and Shruti Bhosale
- Abstract summary: Sparsely gated Mixture of Experts (MoE) models have been shown to be a compute-efficient method to scale model capacity for multilingual machine translation.
We show effective regularization strategies that prevent over-fitting and improve the performance of MoE models on low-resource tasks.
- Score: 8.7660229706359
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Sparsely gated Mixture of Experts (MoE) models have been shown to be a
compute-efficient method to scale model capacity for multilingual machine
translation. However, for low-resource tasks, MoE models severely over-fit. We
show effective regularization strategies, namely dropout techniques for MoE
layers in EOM and FOM, Conditional MoE Routing and Curriculum Learning methods
that prevent over-fitting and improve the performance of MoE models on
low-resource tasks without adversely affecting high-resource tasks. On a
massively multilingual machine translation benchmark, our strategies result in
about +1 chrF++ improvement in very low resource language pairs. We perform an
extensive analysis of the learned MoE routing to better understand the impact
of our regularization methods and how we can improve them.
Related papers
- Mitigating Catastrophic Forgetting in Language Transfer via Model Merging [16.845734486667226]
Branch-and-Merge (BaM) is a new adaptation method based on iteratively merging multiple models.
BaM is based on the insight that this yields lower magnitude but higher quality weight changes.
We demonstrate in an empirical study on Bulgarian and German that BaM can significantly reduce forgetting while matching or even improving target domain performance.
arXiv Detail & Related papers (2024-07-11T17:32:40Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process.
Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z) - Too Brittle To Touch: Comparing the Stability of Quantization and
Distillation Towards Developing Lightweight Low-Resource MT Models [12.670354498961492]
State-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages.
Knowledge Distillation is one popular technique to develop competitive, lightweight models.
arXiv Detail & Related papers (2022-10-27T05:30:13Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.