Related papers: Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese

Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese

URL: http://arxiv.org/abs/2403.13638v2
Date: Thu, 21 Mar 2024 04:03:59 GMT
Title: Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese
Authors: Meet Doshi, Raj Dabre, Pushpak Bhattacharyya,
Abstract summary: Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English. We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language. Then, we train language models containing 28M and 85M parameters on this translationese data (synthetic) We show that their performance on downstream natural language understanding and generative tasks is only 3.56% poorer on NLU tasks and 1.51% on NLG tasks than LMs pre-trained on clean data.
Score: 47.45957604683302
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we explore the utility of Translationese as synthetic data created using machine translation for pre-training language models (LMs). Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English. Recently, there has been a growing interest in using synthetic data to address this data scarcity. We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language. Then, we train language models containing 28M and 85M parameters on this translationese data (synthetic). We show that their performance on downstream natural language understanding and generative tasks is only 3.56% poorer on NLU tasks and 1.51% on NLG tasks than LMs pre-trained on clean data. Further, we propose the use of lightweight TinyLMs pre-trained on clean data to filter synthetic data efficiently which significantly improves the performance of our models. We also find that LMs trained on synthetic data strongly benefit from extended pretraining on a tiny fraction (10%) of clean data. We release the data we collected and created as a part of this work, IndicMonoDoc, the largest collection of monolingual document-level corpora, which we hope will help bridge the gap between English and non-English performance for large language models.

Related papers

Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation [3.3393607383304253]
We develop framework that screens source sentences to form efficient parallel text.<n>We find that by training mostly on complex sentences from both existing and synthetic datasets, our method significantly improves translation quality.<n>This approach not only reduces MT systems training cost by reducing training data requirement, but also showcases LALITA's utility in data augmentation.
arXiv Detail & Related papers (2026-01-13T15:05:19Z)
Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing [5.781452568978427]
We augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model.<n>Experiments on Guarani-Spanish and Quechua-Spanish translation show consistent chrF++ improvements from synthetic data augmentation.<n> diagnostic experiments on Aymara highlight the limitations of generic preprocessing for highly agglutinative languages.
arXiv Detail & Related papers (2026-01-06T16:06:42Z)
Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling [9.380879437204277]
We investigate the benefits of contrastive decoding for generating synthetic corpora.<n>By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data.<n>Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks.
arXiv Detail & Related papers (2025-10-09T14:04:52Z)
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models [107.24906866038431]
We propose REWIRE, REcycling the Web with guIded REwrite, to enrich low-quality documents so that they could become useful for training.<n>We show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks.
arXiv Detail & Related papers (2025-06-05T07:12:12Z)
Scaling Low-Resource MT via Synthetic Data Generation with LLMs [13.10398947215569]
This study focuses on seven diverse target languages.<n>We construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs.<n>We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, and (iii) testing its utility beyond English-centric MT.
arXiv Detail & Related papers (2025-05-20T14:31:54Z)
Multilingual Language Model Pretraining using Machine-translated Data [33.373858866989536]
We translate FineWeb-Edu, a high-quality English web dataset, into nine languages. We show that TransWebLLM matches or outperforms state-of-the-art multilingual models trained using closed data.
arXiv Detail & Related papers (2025-02-18T19:27:53Z)
Efficient Continual Pre-training of LLMs for Low-resource Languages [45.44796295841526]
We develop a new algorithm to select a subset of texts from a larger corpus.<n>In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary.
arXiv Detail & Related papers (2024-12-13T16:13:35Z)
JAPAGEN: Efficient Few/Zero-shot Learning via Japanese Training Dataset Generation with LLM [2.642698101441705]
Large Language Models (LLMs) offer advantages such as enhanced inference efficiency and reduced costs associated with data collection.<n>In this paper, we address the fundamental research question: Can LLMs serve as proficient training data generators for other language tasks?<n> Specifically, we leverage LLMs to synthesize supervised training data under few-shot and zero-shot learning scenarios.<n>We utilize this synthesized data to train compact models (e.g., BERT)
arXiv Detail & Related papers (2024-12-09T18:27:32Z)
Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models [52.00446751692225]
We present a novel and simple yet effective method called textbfDictionary textbfInsertion textbfPrompting (textbfDIP) When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the prompt for LLMs. It then enables better translation into English and better English model thinking steps which leads to obviously better results.
arXiv Detail & Related papers (2024-11-02T05:10:50Z)
Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training [12.29061850090405]
We build upon previous work by replicating existing results on C4 and extending them with our optimized rephrasing pipeline. Our pipeline leads to increased performance on standard evaluation benchmarks in both the mono- and multilingual setup.
arXiv Detail & Related papers (2024-10-28T07:30:05Z)
Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis [3.16714407449467]
We investigate the role of translation and synthetic data in training language models. We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model. To rectify these issues, we pre-train the models with a small dataset of synthesized high-quality Arabic stories.
arXiv Detail & Related papers (2024-05-23T07:53:04Z)
Tagengo: A Multilingual Chat Dataset [3.8073142980733]
We present a high quality dataset of more than 70k prompt-response pairs in 74 languages. We use this dataset to train a state-of-the-art open source English LLM to chat multilingually.
arXiv Detail & Related papers (2024-05-21T09:06:36Z)
Cross-lingual Transfer or Machine Translation? On Data Augmentation for Monolingual Semantic Textual Similarity [2.422759879602353]
Cross-lingual transfer of Wikipedia data exhibits improved performance for monolingual STS. We find a superiority of the Wikipedia domain over the NLI domain for these languages, in contrast to prior studies that focused on NLI as training data.
arXiv Detail & Related papers (2024-03-08T12:28:15Z)
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed. We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z)
Language Contamination Explains the Cross-lingual Capabilities of English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text. This leads to hundreds of millions of foreign language tokens in large-scale datasets. We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z)
Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese. We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset. We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z)
Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language. We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models. Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language. The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model. Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z)
Language Model Prior for Low-Resource Neural Machine Translation [85.55729693003829]
We propose a novel approach to incorporate a LM as prior in a neural translation model (TM) We add a regularization term, which pushes the output distributions of the TM to be probable under the LM prior. Results on two low-resource machine translation datasets show clear improvements even with limited monolingual data.
arXiv Detail & Related papers (2020-04-30T16:29:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.