JamPatoisNLI: A Jamaican Patois Natural Language Inference Dataset
- URL: http://arxiv.org/abs/2212.03419v1
- Date: Wed, 7 Dec 2022 03:07:02 GMT
- Title: JamPatoisNLI: A Jamaican Patois Natural Language Inference Dataset
- Authors: Ruth-Ann Armstrong, John Hewitt and Christopher Manning
- Abstract summary: JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois.
Many of the most-spoken low-resource languages are creoles.
Our experiments show considerably better results from few-shot learning of JamPatoisNLI than for such unrelated languages.
- Score: 7.940548890754674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: JamPatoisNLI provides the first dataset for natural language inference in a
creole language, Jamaican Patois. Many of the most-spoken low-resource
languages are creoles. These languages commonly have a lexicon derived from a
major world language and a distinctive grammar reflecting the languages of the
original speakers and the process of language birth by creolization. This gives
them a distinctive place in exploring the effectiveness of transfer from large
monolingual or multilingual pretrained models. While our work, along with
previous work, shows that transfer from these models to low-resource languages
that are unrelated to languages in their training set is not very effective, we
would expect stronger results from transfer to creoles. Indeed, our experiments
show considerably better results from few-shot learning of JamPatoisNLI than
for such unrelated languages, and help us begin to understand how the unique
relationship between creoles and their high-resource base languages affect
cross-lingual transfer. JamPatoisNLI, which consists of naturally-occurring
premises and expert-written hypotheses, is a step towards steering research
into a traditionally underserved language and a useful benchmark for
understanding cross-lingual NLP.
Related papers
- Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - Language Chameleon: Transformation analysis between languages using
Cross-lingual Post-training based on Pre-trained language models [4.731313022026271]
In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT)
Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.
arXiv Detail & Related papers (2022-09-14T05:20:52Z) - Phylogeny-Inspired Adaptation of Multilingual Models to New Languages [43.62238334380897]
We show how we can use language phylogenetic information to improve cross-lingual transfer leveraging closely related languages.
We perform adapter-based training on languages from diverse language families (Germanic, Uralic, Tupian, Uto-Aztecan) and evaluate on both syntactic and semantic tasks.
arXiv Detail & Related papers (2022-05-19T15:49:19Z) - Linking Emergent and Natural Languages via Corpus Transfer [98.98724497178247]
We propose a novel way to establish a link by corpus transfer between emergent languages and natural languages.
Our approach showcases non-trivial transfer benefits for two different tasks -- language modeling and image captioning.
We also introduce a novel metric to predict the transferability of an emergent language by translating emergent messages to natural language captions grounded on the same images.
arXiv Detail & Related papers (2022-03-24T21:24:54Z) - Cross-Lingual Ability of Multilingual Masked Language Models: A Study of
Language Structure [54.01613740115601]
We study three language properties: constituent order, composition and word co-occurrence.
Our main conclusion is that the contribution of constituent order and word co-occurrence is limited, while the composition is more crucial to the success of cross-linguistic transfer.
arXiv Detail & Related papers (2022-03-16T07:09:35Z) - Role of Language Relatedness in Multilingual Fine-tuning of Language
Models: A Case Study in Indo-Aryan Languages [34.79533646549939]
We explore the impact of leveraging the relatedness of languages that belong to the same family in NLP models using multilingual fine-tuning.
Low resource languages, such as Oriya and Punjabi, are found to be the largest beneficiaries of multilingual fine-tuning.
arXiv Detail & Related papers (2021-09-22T06:37:39Z) - On Language Models for Creoles [8.577162764242845]
Creole languages such as Nigerian Pidgin English and Haitian Creole are under-resourced and largely ignored in the NLP literature.
What grammatical and lexical features are transferred to the creole is a complex process.
While creoles are generally stable, the prominence of some features may be much stronger with certain demographics or in some linguistic situations.
arXiv Detail & Related papers (2021-09-13T15:51:15Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.