What a Creole Wants, What a Creole Needs
- URL: http://arxiv.org/abs/2206.00437v1
- Date: Wed, 1 Jun 2022 12:22:34 GMT
- Title: What a Creole Wants, What a Creole Needs
- Authors: Heather Lent, Kelechi Ogueji, Miryam de Lhoneux, Orevaoghene Ahia,
Anders S{\o}gaard
- Abstract summary: We consider a group of low-resource languages, Creole languages. Creoles are both largely absent from the NLP literature, and also often ignored by society at large due to stigma.
We demonstrate, through conversations with Creole experts and surveys of Creole-speaking communities, how the things needed from language technology can change dramatically from one language to another.
- Score: 1.985426476051888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the natural language processing (NLP) community has given
increased attention to the disparity of efforts directed towards high-resource
languages over low-resource ones. Efforts to remedy this delta often begin with
translations of existing English datasets into other languages. However, this
approach ignores that different language communities have different needs. We
consider a group of low-resource languages, Creole languages. Creoles are both
largely absent from the NLP literature, and also often ignored by society at
large due to stigma, despite these languages having sizable and vibrant
communities. We demonstrate, through conversations with Creole experts and
surveys of Creole-speaking communities, how the things needed from language
technology can change dramatically from one language to another, even when the
languages are considered to be very similar to each other, as with Creoles. We
discuss the prominent themes arising from these conversations, and ultimately
demonstrate that useful language technology cannot be built without involving
the relevant community.
Related papers
- Socially Responsible Data for Large Multilingual Language Models [12.338723881042926]
Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years.
Various efforts are striving for models to accommodate languages of communities outside of the Global North.
arXiv Detail & Related papers (2024-09-08T23:51:04Z) - Guylingo: The Republic of Guyana Creole Corpora [6.582021376649199]
We present a comprehensive corpus designed for advancing NLP research in the domain of Creolese (Guyanese English-lexicon Creole)
We first outline our framework for gathering and digitizing this diverse corpus, inclusive of colloquial expressions, idioms, and regional variations in a low-resource language.
We then demonstrate the challenges of training and evaluating NLP models for machine translation in Creole.
arXiv Detail & Related papers (2024-05-06T20:30:14Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - CreoleVal: Multilingual Multitask Benchmarks for Creoles [46.50887462355172]
We present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks.
It is an aggregate of novel development datasets for reading comprehension, relation classification, and machine translation for Creoles.
arXiv Detail & Related papers (2023-10-30T14:24:20Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - How can NLP Help Revitalize Endangered Languages? A Case Study and
Roadmap for the Cherokee Language [91.79339725967073]
More than 43% of the languages spoken in the world are endangered.
In this work, we focus on discussing how NLP can help revitalize endangered languages.
We take Cherokee, a severely-endangered Native American language, as a case study.
arXiv Detail & Related papers (2022-04-25T18:25:57Z) - Not always about you: Prioritizing community needs when developing
endangered language technology [5.670857685983896]
We discuss the unique technological, cultural, practical, and ethical challenges that researchers and indigenous speech community members face.
We report the perspectives of language teachers, Master Speakers and elders from indigenous communities, as well as the point of view of academics.
arXiv Detail & Related papers (2022-04-12T05:59:39Z) - On Language Models for Creoles [8.577162764242845]
Creole languages such as Nigerian Pidgin English and Haitian Creole are under-resourced and largely ignored in the NLP literature.
What grammatical and lexical features are transferred to the creole is a complex process.
While creoles are generally stable, the prominence of some features may be much stronger with certain demographics or in some linguistic situations.
arXiv Detail & Related papers (2021-09-13T15:51:15Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.