Related papers: Scaling Laws for Multilingual Neural Machine Translation

Scaling Laws for Multilingual Neural Machine Translation

URL: http://arxiv.org/abs/2302.09650v1
Date: Sun, 19 Feb 2023 18:43:24 GMT
Title: Scaling Laws for Multilingual Neural Machine Translation
Authors: Patrick Fernandes, Behrooz Ghorbani, Xavier Garcia, Markus Freitag, Orhan Firat
Abstract summary: We study how increases in the model size affect the model performance and investigate the role of the training mixture composition on the scaling behavior. We find that changing the weightings of the individual language pairs in the training mixture only affect the multiplicative factor of the scaling law. We leverage our observations to predict the performance of multilingual models trained with any language weighting at any scale.
Score: 45.620062316968976
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we provide a large-scale empirical study of the scaling properties of multilingual neural machine translation models. We examine how increases in the model size affect the model performance and investigate the role of the training mixture composition on the scaling behavior. We find that changing the weightings of the individual language pairs in the training mixture only affect the multiplicative factor of the scaling law. In particular, we observe that multilingual models trained using different mixing rates all exhibit the same scaling exponent. Through a novel joint scaling law formulation, we compute the effective number of parameters allocated to each language pair and examine the role of language similarity in the scaling behavior of our models. We find little evidence that language similarity has any impact. In contrast, the direction of the multilinguality plays a significant role, with models translating from multiple languages into English having a larger number of effective parameters per task than their reversed counterparts. Finally, we leverage our observations to predict the performance of multilingual models trained with any language weighting at any scale, significantly reducing efforts required for language balancing in large multilingual models. Our findings apply to both in-domain and out-of-domain test sets and to multiple evaluation metrics, such as ChrF and BLEURT.

Related papers

The Impact of Model Scaling on Seen and Unseen Language Performance [2.012425476229879]
We study the performance and scaling behavior of multilingual Large Language Models across 204 languages. Our findings show significant differences in scaling behavior between zero-shot and two-shot scenarios. In two-shot settings, larger models show clear linear improvements in multilingual text classification.
arXiv Detail & Related papers (2025-01-10T00:10:21Z)
Cross-Linguistic Examination of Machine Translation Transfer Learning [0.0]
This study investigates the effectiveness of transfer learning in machine translation across diverse linguistic families. The research encompasses language pairs from different linguistic backgrounds. Results demonstrate that transfer learning is effective across different language families.
arXiv Detail & Related papers (2024-12-27T16:34:56Z)
Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models [1.5703073293718952]
Token similarity and country similarity as pivotal factors, alongside pre-train data and model size, in enhancing model performance. These insights offer valuable guidance for developing more equitable and effective multilingual language models.
arXiv Detail & Related papers (2024-12-17T03:05:26Z)
Scaling Laws for Multilingual Language Models [41.6318470003173]
A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer. We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio. We derive a power-law relationship that links performance with dataset size, model size and sampling ratios.
arXiv Detail & Related papers (2024-10-15T20:29:38Z)
Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models. We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models. We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z)
The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance. Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes. We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z)
Understanding the effects of language-specific class imbalance in multilingual fine-tuning [0.0]
We show that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with an imbalance leads to worse performance. We modify the traditional class weighing approach to imbalance by calculating class weights separately for each language.
arXiv Detail & Related papers (2024-02-20T13:59:12Z)
On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models. We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z)
Tokenizer Choice For LLM Training: Negligible or Crucial? [30.33170936148845]
We study the influence of tokenizer choice on Large Language Models (LLMs) downstream performance by training 24 mono- and multilingual LLMs. We find that the tokenizer choice can significantly impact the model's downstream performance and training costs. We show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English.
arXiv Detail & Related papers (2023-10-12T22:44:19Z)
Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process. We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks. Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z)
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.