How can NLP Help Revitalize Endangered Languages? A Case Study and
Roadmap for the Cherokee Language
- URL: http://arxiv.org/abs/2204.11909v1
- Date: Mon, 25 Apr 2022 18:25:57 GMT
- Title: How can NLP Help Revitalize Endangered Languages? A Case Study and
Roadmap for the Cherokee Language
- Authors: Shiyue Zhang, Ben Frey, Mohit Bansal
- Abstract summary: More than 43% of the languages spoken in the world are endangered.
In this work, we focus on discussing how NLP can help revitalize endangered languages.
We take Cherokee, a severely-endangered Native American language, as a case study.
- Score: 91.79339725967073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: More than 43% of the languages spoken in the world are endangered, and
language loss currently occurs at an accelerated rate because of globalization
and neocolonialism. Saving and revitalizing endangered languages has become
very important for maintaining the cultural diversity on our planet. In this
work, we focus on discussing how NLP can help revitalize endangered languages.
We first suggest three principles that may help NLP practitioners to foster
mutual understanding and collaboration with language communities, and we
discuss three ways in which NLP can potentially assist in language education.
We then take Cherokee, a severely-endangered Native American language, as a
case study. After reviewing the language's history, linguistic features, and
existing resources, we (in collaboration with Cherokee community members)
arrive at a few meaningful ways NLP practitioners can collaborate with
community partners. We suggest two approaches to enrich the Cherokee language's
resources with machine-in-the-loop processing, and discuss several NLP tools
that people from the Cherokee community have shown interest in. We hope that
our work serves not only to inform the NLP community about Cherokee, but also
to provide inspiration for future work on endangered languages in general. Our
code and data will be open-sourced at
https://github.com/ZhangShiyue/RevitalizeCherokee
Related papers
- Socially Responsible Data for Large Multilingual Language Models [12.338723881042926]
Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years.
Various efforts are striving for models to accommodate languages of communities outside of the Global North.
arXiv Detail & Related papers (2024-09-08T23:51:04Z) - "It's how you do things that matters": Attending to Process to Better
Serve Indigenous Communities with Language Technologies [2.821682550792172]
This position paper explores ethical considerations in building NLP technologies for Indigenous languages.
We report on interviews with 17 researchers working in or with Aboriginal and/or Torres Strait Islander communities.
We recommend practices for NLP researchers to increase attention to the process of engagements with Indigenous communities.
arXiv Detail & Related papers (2024-02-04T23:23:51Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - What a Creole Wants, What a Creole Needs [1.985426476051888]
We consider a group of low-resource languages, Creole languages. Creoles are both largely absent from the NLP literature, and also often ignored by society at large due to stigma.
We demonstrate, through conversations with Creole experts and surveys of Creole-speaking communities, how the things needed from language technology can change dramatically from one language to another.
arXiv Detail & Related papers (2022-06-01T12:22:34Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Ensuring the Inclusive Use of Natural Language Processing in the Global
Response to COVID-19 [58.720142291102135]
We discuss ways in which current and future NLP approaches can be made more inclusive by covering low-resource languages.
We suggest several future directions for researchers interested in maximizing the positive societal impacts of NLP.
arXiv Detail & Related papers (2021-08-11T12:54:26Z) - The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z) - ChrEn: Cherokee-English Machine Translation for Endangered Language
Revitalization [91.96528006301654]
Cherokee is a highly endangered Native American language spoken by the Cherokee people.
There are approximately only 2,000 fluent first language Cherokee speakers remaining in the world.
arXiv Detail & Related papers (2020-10-09T20:28:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.