How can NLP Help Revitalize Endangered Languages? A Case Study and
Roadmap for the Cherokee Language
- URL: http://arxiv.org/abs/2204.11909v1
- Date: Mon, 25 Apr 2022 18:25:57 GMT
- Title: How can NLP Help Revitalize Endangered Languages? A Case Study and
Roadmap for the Cherokee Language
- Authors: Shiyue Zhang, Ben Frey, Mohit Bansal
- Abstract summary: More than 43% of the languages spoken in the world are endangered.
In this work, we focus on discussing how NLP can help revitalize endangered languages.
We take Cherokee, a severely-endangered Native American language, as a case study.
- Score: 91.79339725967073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: More than 43% of the languages spoken in the world are endangered, and
language loss currently occurs at an accelerated rate because of globalization
and neocolonialism. Saving and revitalizing endangered languages has become
very important for maintaining the cultural diversity on our planet. In this
work, we focus on discussing how NLP can help revitalize endangered languages.
We first suggest three principles that may help NLP practitioners to foster
mutual understanding and collaboration with language communities, and we
discuss three ways in which NLP can potentially assist in language education.
We then take Cherokee, a severely-endangered Native American language, as a
case study. After reviewing the language's history, linguistic features, and
existing resources, we (in collaboration with Cherokee community members)
arrive at a few meaningful ways NLP practitioners can collaborate with
community partners. We suggest two approaches to enrich the Cherokee language's
resources with machine-in-the-loop processing, and discuss several NLP tools
that people from the Cherokee community have shown interest in. We hope that
our work serves not only to inform the NLP community about Cherokee, but also
to provide inspiration for future work on endangered languages in general. Our
code and data will be open-sourced at
https://github.com/ZhangShiyue/RevitalizeCherokee
Related papers
- Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages [34.78841410279943]
Endangered languages, such as Navajo, are significantly underrepresented in contemporary language technologies.
This study evaluates Google's Language Identification (LangID) tool, which does not currently support any Native American languages.
arXiv Detail & Related papers (2025-01-27T04:43:18Z) - Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo [0.815557531820863]
This paper presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw'ida, Kalenjin, and Dholuo.
Our project employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages.
We made these resources freely accessible via open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets.
arXiv Detail & Related papers (2025-01-19T10:17:21Z) - "It's how you do things that matters": Attending to Process to Better
Serve Indigenous Communities with Language Technologies [2.821682550792172]
This position paper explores ethical considerations in building NLP technologies for Indigenous languages.
We report on interviews with 17 researchers working in or with Aboriginal and/or Torres Strait Islander communities.
We recommend practices for NLP researchers to increase attention to the process of engagements with Indigenous communities.
arXiv Detail & Related papers (2024-02-04T23:23:51Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - What a Creole Wants, What a Creole Needs [1.985426476051888]
We consider a group of low-resource languages, Creole languages. Creoles are both largely absent from the NLP literature, and also often ignored by society at large due to stigma.
We demonstrate, through conversations with Creole experts and surveys of Creole-speaking communities, how the things needed from language technology can change dramatically from one language to another.
arXiv Detail & Related papers (2022-06-01T12:22:34Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Ensuring the Inclusive Use of Natural Language Processing in the Global
Response to COVID-19 [58.720142291102135]
We discuss ways in which current and future NLP approaches can be made more inclusive by covering low-resource languages.
We suggest several future directions for researchers interested in maximizing the positive societal impacts of NLP.
arXiv Detail & Related papers (2021-08-11T12:54:26Z) - ChrEn: Cherokee-English Machine Translation for Endangered Language
Revitalization [91.96528006301654]
Cherokee is a highly endangered Native American language spoken by the Cherokee people.
There are approximately only 2,000 fluent first language Cherokee speakers remaining in the world.
arXiv Detail & Related papers (2020-10-09T20:28:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.