Related papers: Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

URL: http://arxiv.org/abs/2410.23956v2
Date: Wed, 06 Nov 2024 01:35:36 GMT
Title: Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language
Authors: Jiayi Wang, Yao Lu, Maurice Weber, Max Ryabinin, Yihong Chen, Raphael Tang, Pontus Stenetorp,
Abstract summary: Machine-translated text from a single high-quality source language can contribute significantly to the pretraining of multilingual models. We show that CuatroLLM matches or outperforms state-of-the-art multilingual models trained using closed data. We release our corpus, models, and training pipeline under open licenses at hf.co/britllm/CuatroLLM.
Score: 34.54405113575568
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: English, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). The same cannot be said for most other languages, as leading LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated text from a single high-quality source language can contribute significantly to the pretraining of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into French, German, and Spanish, resulting in a final 300B-token dataset, which we call TransWeb-Edu, and pretrain a 1.3B-parameter model, CuatroLLM, from scratch on this dataset. Across five non-English reasoning tasks, we show that CuatroLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2 and Gemma2, despite using an order of magnitude less data, such as about 6% of the tokens used for Llama3.2's training. We further demonstrate that with additional domain-specific pretraining, amounting to less than 1% of TransWeb-Edu, CuatroLLM surpasses the state of the art in multilingual reasoning. To promote reproducibility, we release our corpus, models, and training pipeline under open licenses at hf.co/britllm/CuatroLLM.

Related papers

Multilingual Language Model Pretraining using Machine-translated Data [33.373858866989536]
We translate FineWeb-Edu, a high-quality English web dataset, into nine languages. We show that TransWebLLM matches or outperforms state-of-the-art multilingual models trained using closed data.
arXiv Detail & Related papers (2025-02-18T19:27:53Z)
EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models [50.459861376459656]
EMMA-500 is a large-scale multilingual language model continue-trained on texts across 546 languages. Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity.
arXiv Detail & Related papers (2024-09-26T14:40:45Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, effectively being crosslingual? This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)
Languages You Know Influence Those You Learn: Impact of Language Characteristics on Multi-Lingual Text-to-Text Transfer [4.554080966463776]
Multi-lingual language models (LM) have been remarkably successful in enabling natural language tasks in low-resource languages. We try to better understand how such models, specifically mT5, transfer *any* linguistic and semantic knowledge across languages. A key finding of this work is that similarity of syntax, morphology and phonology are good predictors of cross-lingual transfer.
arXiv Detail & Related papers (2022-12-04T07:22:21Z)
Multilingual Multimodal Learning with Machine Translated Text [27.7207234512674]
We investigate whether machine translating English multimodal data can be an effective proxy for the lack of readily available multilingual data. We propose two metrics for automatically removing such translations from the resulting datasets. In experiments on five tasks across 20 languages in the IGLUE benchmark, we show that translated data can provide a useful signal for multilingual multimodal learning.
arXiv Detail & Related papers (2022-10-24T11:41:20Z)
Bootstrapping Multilingual Semantic Parsers using Large Language Models [28.257114724384806]
translate-train paradigm of transferring English datasets across multiple languages remains to be the key ingredient for training task-specific multilingual models. We consider the task of multilingual semantic parsing and demonstrate the effectiveness and flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting.
arXiv Detail & Related papers (2022-10-13T19:34:14Z)
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks. Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training. We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z)
Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
Multilingual Translation with Extensible Multilingual Pretraining and Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext. We show that multilingual translation models can be created through multilingual finetuning. We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.