Related papers: Guylingo: The Republic of Guyana Creole Corpora

Guylingo: The Republic of Guyana Creole Corpora

URL: http://arxiv.org/abs/2405.03832v3
Date: Tue, 2 Jul 2024 21:23:32 GMT
Title: Guylingo: The Republic of Guyana Creole Corpora
Authors: Christopher Clarke, Roland Daynauth, Charlene Wilkinson, Hubert Devonish, Jason Mars,
Abstract summary: We present a comprehensive corpus designed for advancing NLP research in the domain of Creolese (Guyanese English-lexicon Creole) We first outline our framework for gathering and digitizing this diverse corpus, inclusive of colloquial expressions, idioms, and regional variations in a low-resource language. We then demonstrate the challenges of training and evaluating NLP models for machine translation in Creole.
Score: 6.582021376649199
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While major languages often enjoy substantial attention and resources, the linguistic diversity across the globe encompasses a multitude of smaller, indigenous, and regional languages that lack the same level of computational support. One such region is the Caribbean. While commonly labeled as "English speaking", the ex-British Caribbean region consists of a myriad of Creole languages thriving alongside English. In this paper, we present Guylingo: a comprehensive corpus designed for advancing NLP research in the domain of Creolese (Guyanese English-lexicon Creole), the most widely spoken language in the culturally rich nation of Guyana. We first outline our framework for gathering and digitizing this diverse corpus, inclusive of colloquial expressions, idioms, and regional variations in a low-resource language. We then demonstrate the challenges of training and evaluating NLP models for machine translation in Creole. Lastly, we discuss the unique opportunities presented by recent NLP advancements for accelerating the formal adoption of Creole languages as official languages in the Caribbean.

Related papers

Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages [34.78841410279943]
Endangered languages, such as Navajo, are significantly underrepresented in contemporary language technologies. This study evaluates Google's Language Identification (LangID) tool, which does not currently support any Native American languages.
arXiv Detail & Related papers (2025-01-27T04:43:18Z)
Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources [45.07333085270152]
Argentina has a large yet little-known Indigenous linguistic diversity, encompassing at least 40 different languages. We present a systematization of the Indigenous languages spoken in Argentina, classifying them into seven language families. For each one, we present an estimation of the national Indigenous population size, based on the most recent Argentinian census.
arXiv Detail & Related papers (2025-01-17T03:47:19Z)
Molyé: A Corpus-based Approach to Language Contact in Colonial France [10.054303678856536]
Moly'e corpus combines stereotypical representations of language variation in Europe with early attested French-based Creole languages. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.
arXiv Detail & Related papers (2024-08-08T16:09:40Z)
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z)
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z)
CreoleVal: Multilingual Multitask Benchmarks for Creoles [46.50887462355172]
We present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks. It is an aggregate of novel development datasets for reading comprehension, relation classification, and machine translation for Creoles.
arXiv Detail & Related papers (2023-10-30T14:24:20Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP. We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba. Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region. All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z)
JamPatoisNLI: A Jamaican Patois Natural Language Inference Dataset [7.940548890754674]
JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. Our experiments show considerably better results from few-shot learning of JamPatoisNLI than for such unrelated languages.
arXiv Detail & Related papers (2022-12-07T03:07:02Z)
What a Creole Wants, What a Creole Needs [1.985426476051888]
We consider a group of low-resource languages, Creole languages. Creoles are both largely absent from the NLP literature, and also often ignored by society at large due to stigma. We demonstrate, through conversations with Creole experts and surveys of Creole-speaking communities, how the things needed from language technology can change dramatically from one language to another.
arXiv Detail & Related papers (2022-06-01T12:22:34Z)
On Language Models for Creoles [8.577162764242845]
Creole languages such as Nigerian Pidgin English and Haitian Creole are under-resourced and largely ignored in the NLP literature. What grammatical and lexical features are transferred to the creole is a complex process. While creoles are generally stable, the prominence of some features may be much stronger with certain demographics or in some linguistic situations.
arXiv Detail & Related papers (2021-09-13T15:51:15Z)
Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis. We cluster all the target languages into multiple groups and name each group as a representation sprachbund. Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.