Related papers: ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality

ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality

URL: http://arxiv.org/abs/2510.22037v1
Date: Fri, 24 Oct 2025 21:45:22 GMT
Title: ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
Authors: Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, Sayna Ebrahimi,
Abstract summary: We undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments.<n>We introduce the Adaptive Transfer Scaling Law (ATLAS) for both monolingual and multilingual pretraining.<n>Our analyses shed light on multilingual learning dynamics, transfer properties between languages, and the curse of multilinguality.
Score: 45.16490310398125
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling laws research has focused overwhelmingly on English -- yet the most prominent AI models explicitly serve billions of international users. In this work, we undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments, spanning 10M-8B model parameters, 400+ training languages and 48 evaluation languages. We introduce the Adaptive Transfer Scaling Law (ATLAS) for both monolingual and multilingual pretraining, which outperforms existing scaling laws' out-of-sample generalization often by more than 0.3 R^2. Our analyses of the experiments shed light on multilingual learning dynamics, transfer properties between languages, and the curse of multilinguality. First, we derive a cross-lingual transfer matrix, empirically measuring mutual benefit scores between 38 x 38=1444 language pairs. Second, we derive a language-agnostic scaling law that reveals how to optimally scale model size and data when adding languages without sacrificing performance. Third, we identify the computational crossover points for when to pretrain from scratch versus finetune from multilingual checkpoints. We hope these findings provide the scientific foundation for democratizing scaling laws across languages, and enable practitioners to efficiently scale models -- beyond English-first AI.

Related papers

Scaling Laws for Code: Every Programming Language Matters [73.6302896079007]
Code large language models (Code LLMs) are powerful but costly to train.<n>Different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance.<n>We present the first systematic exploration of scaling laws for multilingual code pre-training.
arXiv Detail & Related papers (2025-12-15T16:07:34Z)
Revisiting Multilingual Data Mixtures in Language Model Pretraining [20.282622416939997]
We study the impact of different multilingual data mixtures in pretraining large language models.<n>We find that combining English and multilingual data does not necessarily degrade the in-language performance of either group.<n>We do not observe a significant "curse of multilinguality" as the number of training languages increases.
arXiv Detail & Related papers (2025-10-29T20:46:03Z)
Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages [0.0]
Large language models (LLMs) are transforming social-science research by enabling scalable, precise analysis.<n>We fine-tune lightweight LLaMA 3.2-3B models on monolingual, bilingual, or multilingual data sets to classify immigration-related tweets.<n>We evaluate whether minimal language-specific fine-tuning enables cross-lingual topic detection and whether adding targeted languages corrects pre-training biases.
arXiv Detail & Related papers (2025-08-08T16:23:24Z)
OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models [55.63479003621053]
We introduce OWLS, an open-access suite of multilingual speech recognition and translation models.<n>We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling.<n>We show how OWLS can be used to power new research directions by discovering emergent abilities in large-scale speech models.
arXiv Detail & Related papers (2025-02-14T18:51:40Z)
Scaling Laws for Multilingual Language Models [41.6318470003173]
A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer.<n>We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio.<n>We derive a power-law relationship that links performance with dataset size, model size and sampling ratios.
arXiv Detail & Related papers (2024-10-15T20:29:38Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.<n>But can these models relate corresponding concepts across languages, i.e., be crosslingual?<n>This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
KBioXLM: A Knowledge-anchored Biomedical Multilingual Pretrained Language Model [37.69464822182714]
Most biomedical pretrained language models are monolingual and cannot handle the growing cross-lingual requirements. We propose a model called KBioXLM, which transforms the multilingual pretrained model XLM-R into the biomedical domain using a knowledge-anchored approach.
arXiv Detail & Related papers (2023-11-20T07:02:35Z)
Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages. In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z)
Scaling Laws for Multilingual Neural Machine Translation [45.620062316968976]
We study how increases in the model size affect the model performance and investigate the role of the training mixture composition on the scaling behavior. We find that changing the weightings of the individual language pairs in the training mixture only affect the multiplicative factor of the scaling law. We leverage our observations to predict the performance of multilingual models trained with any language weighting at any scale.
arXiv Detail & Related papers (2023-02-19T18:43:24Z)
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages. We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.