Ancestor-to-Creole Transfer is Not a Walk in the Park
- URL: http://arxiv.org/abs/2206.04371v1
- Date: Thu, 9 Jun 2022 09:28:10 GMT
- Title: Ancestor-to-Creole Transfer is Not a Walk in the Park
- Authors: Heather Lent, Emanuele Bugliarello, Anders S{\o}gaard
- Abstract summary: We aim to learn language models for Creole languages for which large volumes of data are not readily available.
We find that standard transfer methods do not facilitate ancestry transfer.
Surprisingly, different from other non-Creole languages, a very distinct two-phase pattern emerges for Creoles.
- Score: 9.926231893220061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We aim to learn language models for Creole languages for which large volumes
of data are not readily available, and therefore explore the potential transfer
from ancestor languages (the 'Ancestry Transfer Hypothesis'). We find that
standard transfer methods do not facilitate ancestry transfer. Surprisingly,
different from other non-Creole languages, a very distinct two-phase pattern
emerges for Creoles: As our training losses plateau, and language models begin
to overfit on their source languages, perplexity on the Creoles drop. We
explore if this compression phase can lead to practically useful language
models (the 'Ancestry Bottleneck Hypothesis'), but also falsify this. Moreover,
we show that Creoles even exhibit this two-phase pattern even when training on
random, unrelated languages. Thus Creoles seem to be typological outliers and
we speculate whether there is a link between the two observations.
Related papers
- Molyé: A Corpus-based Approach to Language Contact in Colonial France [10.054303678856536]
Moly'e corpus combines stereotypical representations of language variation in Europe with early attested French-based Creole languages.
It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.
arXiv Detail & Related papers (2024-08-08T16:09:40Z) - Measuring Cross-lingual Transfer in Bytes [9.011910726620538]
We show that models from diverse languages perform similarly to a target language in a cross-lingual setting.
We also found evidence that this transfer is not related to language contamination or language proximity.
Our experiments have opened up new possibilities for measuring how much data represents the language-agnostic representations learned during pretraining.
arXiv Detail & Related papers (2024-04-12T01:44:46Z) - CreoleVal: Multilingual Multitask Benchmarks for Creoles [46.50887462355172]
We present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks.
It is an aggregate of novel development datasets for reading comprehension, relation classification, and machine translation for Creoles.
arXiv Detail & Related papers (2023-10-30T14:24:20Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - JamPatoisNLI: A Jamaican Patois Natural Language Inference Dataset [7.940548890754674]
JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois.
Many of the most-spoken low-resource languages are creoles.
Our experiments show considerably better results from few-shot learning of JamPatoisNLI than for such unrelated languages.
arXiv Detail & Related papers (2022-12-07T03:07:02Z) - Same Neurons, Different Languages: Probing Morphosyntax in Multilingual
Pre-trained Models [84.86942006830772]
We conjecture that multilingual pre-trained models can derive language-universal abstractions about grammar.
We conduct the first large-scale empirical study over 43 languages and 14 morphosyntactic categories with a state-of-the-art neuron-level probe.
arXiv Detail & Related papers (2022-05-04T12:22:31Z) - Language Contamination Explains the Cross-lingual Capabilities of
English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text.
This leads to hundreds of millions of foreign language tokens in large-scale datasets.
We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z) - On Language Models for Creoles [8.577162764242845]
Creole languages such as Nigerian Pidgin English and Haitian Creole are under-resourced and largely ignored in the NLP literature.
What grammatical and lexical features are transferred to the creole is a complex process.
While creoles are generally stable, the prominence of some features may be much stronger with certain demographics or in some linguistic situations.
arXiv Detail & Related papers (2021-09-13T15:51:15Z) - Constructing a Family Tree of Ten Indo-European Languages with
Delexicalized Cross-linguistic Transfer Patterns [57.86480614673034]
We formalize the delexicalized transfer as interpretable tree-to-string and tree-to-tree patterns.
This allows us to quantitatively probe cross-linguistic transfer and extend inquiries of Second Language Acquisition.
arXiv Detail & Related papers (2020-07-17T15:56:54Z) - Translation Artifacts in Cross-lingual Transfer Learning [51.66536640084888]
We show that machine translation can introduce subtle artifacts that have a notable impact in existing cross-lingual models.
In natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them.
We also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
arXiv Detail & Related papers (2020-04-09T17:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.