Related papers: Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

URL: http://arxiv.org/abs/2506.00469v1
Date: Sat, 31 May 2025 08:37:17 GMT
Title: Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data
Authors: Shaoxiong Ji, Zihao Li, Jaakko Paavola, Indraneil Paul, Hengyu Luo, Jörg Tiedemann,
Abstract summary: We study the impact of bilingual translation data for massively multilingual language adaptation of the Llama3 family of models to 500 languages.<n>We construct the MaLA bilingual translation corpus, containing data from more than 2,500 language pairs.<n>We develop the EMMA-500 Llama 3 suite of four massively multilingual models.
Score: 11.636375417636904
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper investigates a critical design decision in the practice of massively multilingual continual pre-training -- the inclusion of parallel data. Specifically, we study the impact of bilingual translation data for massively multilingual language adaptation of the Llama3 family of models to 500 languages. To this end, we construct the MaLA bilingual translation corpus, containing data from more than 2,500 language pairs. Subsequently, we develop the EMMA-500 Llama 3 suite of four massively multilingual models -- continually pre-trained from the Llama 3 family of base models extensively on diverse data mixes up to 671B tokens -- and explore the effect of continual pre-training with or without bilingual translation data. Comprehensive evaluation across 7 tasks and 12 benchmarks demonstrates that bilingual data tends to enhance language transfer and performance, particularly for low-resource languages. We open-source the MaLA corpus, EMMA-500 Llama 3 suite artefacts, code, and model generations.

Related papers

Revisiting Multilingual Data Mixtures in Language Model Pretraining [20.282622416939997]
We study the impact of different multilingual data mixtures in pretraining large language models.<n>We find that combining English and multilingual data does not necessarily degrade the in-language performance of either group.<n>We do not observe a significant "curse of multilinguality" as the number of training languages increases.
arXiv Detail & Related papers (2025-10-29T20:46:03Z)
From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora [85.44082712798553]
We introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks.<n>This dataset spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage.<n>Experiments show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.
arXiv Detail & Related papers (2025-05-20T07:43:45Z)
Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language [34.54405113575568]
Machine-translated text from a single high-quality source language can contribute significantly to the pretraining of multilingual models. We show that CuatroLLM matches or outperforms state-of-the-art multilingual models trained using closed data. We release our corpus, models, and training pipeline under open licenses at hf.co/britllm/CuatroLLM.
arXiv Detail & Related papers (2024-10-31T14:09:50Z)
EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models [50.459861376459656]
EMMA-500 is a large-scale multilingual language model continue-trained on texts across 546 languages.<n>Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity.
arXiv Detail & Related papers (2024-09-26T14:40:45Z)
When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages [25.52470575274251]
We pre-train over 10,000 monolingual and multilingual language models for over 250 languages. We find that in moderation, adding multilingual data improves low-resource language modeling performance. As dataset sizes increase, adding multilingual data begins to hurt performance for both low-resource and high-resource languages.
arXiv Detail & Related papers (2023-11-15T18:47:42Z)
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset [66.12330208082442]
We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process.
arXiv Detail & Related papers (2023-09-09T02:34:01Z)
PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z)
Bootstrapping Multilingual Semantic Parsers using Large Language Models [28.257114724384806]
translate-train paradigm of transferring English datasets across multiple languages remains to be the key ingredient for training task-specific multilingual models. We consider the task of multilingual semantic parsing and demonstrate the effectiveness and flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting.
arXiv Detail & Related papers (2022-10-13T19:34:14Z)
Building Machine Translation Systems for the Next Thousand Languages [102.24310122155073]
We describe results in three research domains: building clean, web-mined datasets for 1500+ languages, developing practical MT models for under-served languages, and studying the limitations of evaluation metrics for these languages. We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.
arXiv Detail & Related papers (2022-05-09T00:24:13Z)
Multilingual Translation with Extensible Multilingual Pretraining and Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext. We show that multilingual translation models can be created through multilingual finetuning. We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.