Crosslingual Structural Priming and the Pre-Training Dynamics of
Bilingual Language Models
- URL: http://arxiv.org/abs/2310.07929v1
- Date: Wed, 11 Oct 2023 22:57:03 GMT
- Title: Crosslingual Structural Priming and the Pre-Training Dynamics of
Bilingual Language Models
- Authors: Catherine Arnett, Tyler A. Chang, James A. Michaelov, Benjamin K.
Bergen
- Abstract summary: We use structural priming to test for abstract grammatical representations with causal effects on model outputs.
We extend the approach to a Dutch-English bilingual setting, and we evaluate a Dutch-English language model during pre-training.
We find that crosslingual structural priming effects emerge early after exposure to the second language, with less than 1M tokens of data in that language.
- Score: 6.845954748361076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Do multilingual language models share abstract grammatical representations
across languages, and if so, when do these develop? Following Sinclair et al.
(2022), we use structural priming to test for abstract grammatical
representations with causal effects on model outputs. We extend the approach to
a Dutch-English bilingual setting, and we evaluate a Dutch-English language
model during pre-training. We find that crosslingual structural priming effects
emerge early after exposure to the second language, with less than 1M tokens of
data in that language. We discuss implications for data contamination,
low-resource transfer, and how abstract grammatical representations emerge in
multilingual models.
Related papers
- Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement [1.4335183427838039]
We take the approach of developing curated synthetic data on a large scale, with specific properties.
We use a new multiple-choice task and datasets, Blackbird Language Matrices, to focus on a specific grammatical structural phenomenon.
We show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences.
arXiv Detail & Related papers (2024-09-10T14:58:55Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Language Contamination Explains the Cross-lingual Capabilities of
English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text.
This leads to hundreds of millions of foreign language tokens in large-scale datasets.
We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z) - Cross-lingual Transfer of Monolingual Models [2.332247755275824]
We introduce a cross-lingual transfer method for monolingual models based on domain adaptation.
We study the effects of such transfer from four different languages to English.
arXiv Detail & Related papers (2021-09-15T15:00:53Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z) - Multilingual AMR-to-Text Generation [22.842874899794996]
We create multilingual AMR-to-text models that generate in twenty one different languages.
For eighteen languages, based on automatic metrics, our multilingual models surpass baselines that generate into a single language.
We analyse the ability of our multilingual models to accurately capture morphology and word order using human evaluation, and find that native speakers judge our generations to be fluent.
arXiv Detail & Related papers (2020-11-10T22:47:14Z) - Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank [46.626315158735615]
Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language varieties.
This presents a challenge for language varieties unfamiliar to these models, whose labeled emphand unlabeled data is too limited to train a monolingual model effectively.
We propose the use of additional language-specific pretraining and vocabulary augmentation to adapt multilingual models to low-resource settings.
arXiv Detail & Related papers (2020-09-29T16:12:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.