On Language Models for Creoles
- URL: http://arxiv.org/abs/2109.06074v1
- Date: Mon, 13 Sep 2021 15:51:15 GMT
- Title: On Language Models for Creoles
- Authors: Heather Lent, Emanuele Bugliarello, Miryam de Lhoneux, Chen Qiu and
Anders S{\o}gaard
- Abstract summary: Creole languages such as Nigerian Pidgin English and Haitian Creole are under-resourced and largely ignored in the NLP literature.
What grammatical and lexical features are transferred to the creole is a complex process.
While creoles are generally stable, the prominence of some features may be much stronger with certain demographics or in some linguistic situations.
- Score: 8.577162764242845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Creole languages such as Nigerian Pidgin English and Haitian Creole are
under-resourced and largely ignored in the NLP literature. Creoles typically
result from the fusion of a foreign language with multiple local languages, and
what grammatical and lexical features are transferred to the creole is a
complex process. While creoles are generally stable, the prominence of some
features may be much stronger with certain demographics or in some linguistic
situations. This paper makes several contributions: We collect existing corpora
and release models for Haitian Creole, Nigerian Pidgin English, and Singaporean
Colloquial English. We evaluate these models on intrinsic and extrinsic tasks.
Motivated by the above literature, we compare standard language models with
distributionally robust ones and find that, somewhat surprisingly, the standard
language models are superior to the distributionally robust ones. We
investigate whether this is an effect of over-parameterization or relative
distributional stability, and find that the difference persists in the absence
of over-parameterization, and that drift is limited, confirming the relative
stability of creole languages.
Related papers
- CreoleVal: Multilingual Multitask Benchmarks for Creoles [46.50887462355172]
We present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks.
It is an aggregate of novel development datasets for reading comprehension, relation classification, and machine translation for Creoles.
arXiv Detail & Related papers (2023-10-30T14:24:20Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - JamPatoisNLI: A Jamaican Patois Natural Language Inference Dataset [7.940548890754674]
JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois.
Many of the most-spoken low-resource languages are creoles.
Our experiments show considerably better results from few-shot learning of JamPatoisNLI than for such unrelated languages.
arXiv Detail & Related papers (2022-12-07T03:07:02Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Ancestor-to-Creole Transfer is Not a Walk in the Park [9.926231893220061]
We aim to learn language models for Creole languages for which large volumes of data are not readily available.
We find that standard transfer methods do not facilitate ancestry transfer.
Surprisingly, different from other non-Creole languages, a very distinct two-phase pattern emerges for Creoles.
arXiv Detail & Related papers (2022-06-09T09:28:10Z) - What a Creole Wants, What a Creole Needs [1.985426476051888]
We consider a group of low-resource languages, Creole languages. Creoles are both largely absent from the NLP literature, and also often ignored by society at large due to stigma.
We demonstrate, through conversations with Creole experts and surveys of Creole-speaking communities, how the things needed from language technology can change dramatically from one language to another.
arXiv Detail & Related papers (2022-06-01T12:22:34Z) - Do Multilingual Language Models Capture Differing Moral Norms? [71.52261949766101]
Massively multilingual sentence representations are trained on large corpora of uncurated data.
This may cause the models to grasp cultural values including moral judgments from the high-resource languages.
The lack of data in certain languages can also lead to developing random and thus potentially harmful beliefs.
arXiv Detail & Related papers (2022-03-18T12:26:37Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Neural Polysynthetic Language Modelling [15.257624461339867]
In high-resource languages, a common approach is to treat morphologically-distinct variants of a common root as completely independent word types.
This assumes, that there are limited inflections per root, and that the majority will appear in a large enough corpus.
We examine the current state-of-the-art in language modelling, machine translation, and text prediction for four polysynthetic languages.
arXiv Detail & Related papers (2020-05-11T22:57:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.