Related papers: Language Contamination Explains the Cross-lingual Capabilities of English Pretrained Models

Language Contamination Explains the Cross-lingual Capabilities of English Pretrained Models

URL: http://arxiv.org/abs/2204.08110v1
Date: Sun, 17 Apr 2022 23:56:54 GMT
Title: Language Contamination Explains the Cross-lingual Capabilities of English Pretrained Models
Authors: Terra Blevins and Luke Zettlemoyer
Abstract summary: We find that common English pretraining corpora contain significant amounts of non-English text. This leads to hundreds of millions of foreign language tokens in large-scale datasets. We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
Score: 79.38278330678965
License: http://creativecommons.org/licenses/by/4.0/
Abstract: English pretrained language models, which make up the backbone of many modern NLP systems, require huge amounts of unlabeled training data. These models are generally presented as being trained only on English text but have been found to transfer surprisingly well to other languages. We investigate this phenomenon and find that common English pretraining corpora actually contain significant amounts of non-English text: even when less than 1% of data is not English (well within the error rate of strong language classifiers), this leads to hundreds of millions of foreign language tokens in large-scale datasets. We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them, with target language performance strongly correlated to the amount of in-language data seen during pretraining. In light of these findings, we argue that no model is truly monolingual when pretrained at scale, which should be considered when evaluating cross-lingual transfer.

Related papers

Adapting Language Models to Indonesian Local Languages: An Empirical Study of Language Transferability on Zero-Shot Settings [1.1556013985948772]
We evaluate transferability of pre-trained language models to low-resource Indonesian local languages.<n>We group the target languages into three categories: seen, partially seen, and unseen.<n> Multilingual models perform best on seen languages, moderately on partially seen ones, and poorly on unseen languages.<n>We find that MAD-X significantly improves performance, especially for seen and partially seen languages, without requiring labeled data in the target language.
arXiv Detail & Related papers (2025-07-02T12:17:55Z)
Tracing Multilingual Factual Knowledge Acquisition in Pretraining [62.95057983661562]
Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data.<n>We trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B.<n>We find that both accuracy and consistency improve over time for most languages.
arXiv Detail & Related papers (2025-05-20T18:39:56Z)
PreAlign: Boosting Cross-Lingual Transfer by Early Establishment of Multilingual Alignment [68.20851615263953]
Large language models demonstrate reasonable multilingual abilities, despite predominantly English-centric pretraining. The spontaneous multilingual alignment in these models is shown to be weak, leading to unsatisfactory cross-lingual transfer and knowledge sharing. We propose PreAlign, a framework that establishes multilingual alignment prior to language model pretraining.
arXiv Detail & Related papers (2024-07-23T06:59:53Z)
Self-Translate-Train: Enhancing Cross-Lingual Transfer of Large Language Models via Inherent Capability [31.025371443719404]
Self-Translate-Train is a method that lets large language models translate training data into the target language and fine-tunes the model on its own generated data. By demonstrating that Self-Translate-Train outperforms zero-shot transfer, we encourage further exploration of better methods to elicit cross-lingual capabilities of LLMs.
arXiv Detail & Related papers (2024-06-29T14:40:23Z)
Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process. We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks. Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z)
A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation [25.05948665615943]
We create a new African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset. We show that the most effective strategy for transferring both to additional languages and to additional domains is to fine-tune large pre-trained models on small quantities of high-quality translation data.
arXiv Detail & Related papers (2022-05-04T12:11:47Z)
Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages. We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z)
Revisiting the Primacy of English in Zero-shot Cross-lingual Transfer [39.360667403003745]
Zero-shot cross-lingual transfer is emerging as a practical solution. English is the dominant source language for transfer, as reinforced by popular zero-shot benchmarks. We find that other high-resource languages such as German and Russian often transfer more effectively.
arXiv Detail & Related papers (2021-06-30T16:05:57Z)
Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance. We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks. To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z)
When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models [2.457872341625575]
Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. We show that such models behave in multiple ways on unseen languages.
arXiv Detail & Related papers (2020-10-24T10:15:03Z)
Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank [46.626315158735615]
Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language varieties. This presents a challenge for language varieties unfamiliar to these models, whose labeled emphand unlabeled data is too limited to train a monolingual model effectively. We propose the use of additional language-specific pretraining and vocabulary augmentation to adapt multilingual models to low-resource settings.
arXiv Detail & Related papers (2020-09-29T16:12:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.