Scaling Laws for Multilingual Neural Machine Translation
- URL: http://arxiv.org/abs/2302.09650v1
- Date: Sun, 19 Feb 2023 18:43:24 GMT
- Title: Scaling Laws for Multilingual Neural Machine Translation
- Authors: Patrick Fernandes, Behrooz Ghorbani, Xavier Garcia, Markus Freitag,
Orhan Firat
- Abstract summary: We study how increases in the model size affect the model performance and investigate the role of the training mixture composition on the scaling behavior.
We find that changing the weightings of the individual language pairs in the training mixture only affect the multiplicative factor of the scaling law.
We leverage our observations to predict the performance of multilingual models trained with any language weighting at any scale.
- Score: 45.620062316968976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we provide a large-scale empirical study of the scaling
properties of multilingual neural machine translation models. We examine how
increases in the model size affect the model performance and investigate the
role of the training mixture composition on the scaling behavior. We find that
changing the weightings of the individual language pairs in the training
mixture only affect the multiplicative factor of the scaling law. In
particular, we observe that multilingual models trained using different mixing
rates all exhibit the same scaling exponent. Through a novel joint scaling law
formulation, we compute the effective number of parameters allocated to each
language pair and examine the role of language similarity in the scaling
behavior of our models. We find little evidence that language similarity has
any impact. In contrast, the direction of the multilinguality plays a
significant role, with models translating from multiple languages into English
having a larger number of effective parameters per task than their reversed
counterparts. Finally, we leverage our observations to predict the performance
of multilingual models trained with any language weighting at any scale,
significantly reducing efforts required for language balancing in large
multilingual models. Our findings apply to both in-domain and out-of-domain
test sets and to multiple evaluation metrics, such as ChrF and BLEURT.
Related papers
- Scaling Laws for Multilingual Language Models [41.6318470003173]
A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer.
We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio.
We derive a power-law relationship that links performance with dataset size, model size and sampling ratios.
arXiv Detail & Related papers (2024-10-15T20:29:38Z) - Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Understanding the effects of language-specific class imbalance in
multilingual fine-tuning [0.0]
We show that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with an imbalance leads to worse performance.
We modify the traditional class weighing approach to imbalance by calculating class weights separately for each language.
arXiv Detail & Related papers (2024-02-20T13:59:12Z) - On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based
Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models.
We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z) - Tokenizer Choice For LLM Training: Negligible or Crucial? [30.33170936148845]
We study the influence of tokenizer choice on Large Language Models (LLMs) downstream performance by training 24 mono- and multilingual LLMs.
We find that the tokenizer choice can significantly impact the model's downstream performance and training costs.
We show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English.
arXiv Detail & Related papers (2023-10-12T22:44:19Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.