Scaling Laws for Multilingual Language Models
- URL: http://arxiv.org/abs/2410.12883v1
- Date: Tue, 15 Oct 2024 20:29:38 GMT
- Title: Scaling Laws for Multilingual Language Models
- Authors: Yifei He, Alon Benhaim, Barun Patra, Praneetha Vaddamanu, Sanchit Ahuja, Parul Chopra, Vishrav Chaudhary, Han Zhao, Xia Song,
- Abstract summary: A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer.
We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio.
We derive a power-law relationship that links performance with dataset size, model size and sampling ratios.
- Score: 41.6318470003173
- License:
- Abstract: We propose a novel scaling law for general-purpose decoder-only language models (LMs) trained on multilingual data, addressing the problem of balancing languages during multilingual pretraining. A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer. To address this, we shift the focus from individual languages to language families. We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio, independent of other languages in the mixture. This insight simplifies the complexity of multilingual scaling and make the analysis scalable to an arbitrary number of languages. Building on this hypothesis, we derive a power-law relationship that links performance with dataset size, model size and sampling ratios. This relationship enables us to predict performance across various combinations of the above three quantities, and derive the optimal sampling ratios at different model scales. To demonstrate the effectiveness and accuracy of our proposed scaling law, we perform a large-scale empirical study, training more than 100 models on 23 languages spanning 5 language families. Our experiments show that the optimal sampling ratios derived from small models (85M parameters) generalize effectively to models that are several orders of magnitude larger (1.2B parameters), offering a resource-efficient approach for multilingual LM training at scale.
Related papers
- Optimizing Low-Resource Language Model Training: Comprehensive Analysis of Multi-Epoch, Multi-Lingual, and Two-Stage Approaches [3.809045695573932]
Existing works adopt multi-epoch, multi-lingual, and two-stage training to utilize the limited target language corpus efficiently.
We exhaustively explore training setups for low-resource language LLM, combining these three approaches.
As the amount of target language corpus decreases, the optimal training approach shifts from monolingual single-stage training to multi-lingual two-stage training at a compute budget dependent threshold.
arXiv Detail & Related papers (2024-10-16T07:45:56Z) - Do Multilingual Large Language Models Mitigate Stereotype Bias? [9.31741279000585]
This study systematically trains six LLMs of identical size and architecture in English, German, French, Italian, and Spanish.
We observe that multilingual models achieve not only lower bias but also superior prediction accuracy when compared to monolingual models.
arXiv Detail & Related papers (2024-07-08T08:46:50Z) - Targeted Multilingual Adaptation for Low-resource Language Families [17.212424929235624]
We study best practices for adapting a pre-trained model to a language family.
Our adapted models significantly outperform mono- and multilingual baselines.
Low-resource languages can be aggressively up-sampled during training at little detriment to performance in high-resource languages.
arXiv Detail & Related papers (2024-05-20T23:38:06Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - UniMax: Fairer and more Effective Language Sampling for Large-Scale
Multilingual Pretraining [92.3702056505905]
We propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages.
We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases.
arXiv Detail & Related papers (2023-04-18T17:45:50Z) - Scaling Laws for Multilingual Neural Machine Translation [45.620062316968976]
We study how increases in the model size affect the model performance and investigate the role of the training mixture composition on the scaling behavior.
We find that changing the weightings of the individual language pairs in the training mixture only affect the multiplicative factor of the scaling law.
We leverage our observations to predict the performance of multilingual models trained with any language weighting at any scale.
arXiv Detail & Related papers (2023-02-19T18:43:24Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Towards Zero-Shot Multilingual Synthetic Question and Answer Generation
for Cross-Lingual Reading Comprehension [20.570539023748424]
We propose a simple method to generate multilingual question and answer pairs on a large scale.
These synthetic samples can be used to improve the zero-shot performance of multilingual QA models on target languages.
arXiv Detail & Related papers (2020-10-22T19:59:37Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.