Harnessing the Power of Artificial Intelligence to Vitalize Endangered Indigenous Languages: Technologies and Experiences
- URL: http://arxiv.org/abs/2407.12620v2
- Date: Mon, 29 Jul 2024 17:19:43 GMT
- Title: Harnessing the Power of Artificial Intelligence to Vitalize Endangered Indigenous Languages: Technologies and Experiences
- Authors: Claudio Pinhanez, Paulo Cavalin, Luciana Storto, Thomas Finbow, Alexander Cobbinah, Julio Nogima, Marisa Vasconcelos, Pedro Domingues, Priscila de Souza Mizukami, Nicole Grell, Majoí Gongora, Isabel Gonçalves,
- Abstract summary: We discuss the decreasing diversity of languages in the world and how working with Indigenous languages poses unique ethical challenges for AI and NLP.
We report encouraging results in the development of high-quality machine learning translators for Indigenous languages.
We present prototypes we have built in projects done in 2023 and 2024 with Indigenous communities in Brazil, aimed at facilitating writing.
- Score: 31.62071644137294
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Since 2022 we have been exploring application areas and technologies in which Artificial Intelligence (AI) and modern Natural Language Processing (NLP), such as Large Language Models (LLMs), can be employed to foster the usage and facilitate the documentation of Indigenous languages which are in danger of disappearing. We start by discussing the decreasing diversity of languages in the world and how working with Indigenous languages poses unique ethical challenges for AI and NLP. To address those challenges, we propose an alternative development AI cycle based on community engagement and usage. Then, we report encouraging results in the development of high-quality machine learning translators for Indigenous languages by fine-tuning state-of-the-art (SOTA) translators with tiny amounts of data and discuss how to avoid some common pitfalls in the process. We also present prototypes we have built in projects done in 2023 and 2024 with Indigenous communities in Brazil, aimed at facilitating writing, and discuss the development of Indigenous Language Models (ILMs) as a replicable and scalable way to create spell-checkers, next-word predictors, and similar tools. Finally, we discuss how we envision a future for language documentation where dying languages are preserved as interactive language models.
Related papers
- Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction.
Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese.
We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z) - From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language Representation [0.0]
generative large language models (LLMs) stand at the forefront of innovation, showcasing unparalleled abilities in text understanding and generation.
However, the limited representation of low-resource languages like Ukrainian poses a notable challenge, restricting the reach and relevance of this technology.
Our paper addresses this by fine-tuning the open-source Gemma and Mistral LLMs with Ukrainian datasets, aiming to improve their linguistic proficiency.
arXiv Detail & Related papers (2024-04-14T04:25:41Z) - Distributed agency in second language learning and teaching through generative AI [0.0]
ChatGPT can provide informal second language practice through chats in written or voice forms.
Instructors can use AI to build learning and assessment materials in a variety of media.
arXiv Detail & Related papers (2024-03-29T14:55:40Z) - Building a Language-Learning Game for Brazilian Indigenous Languages: A Case of Study [0.0]
We describe a process to automatically generate language exercises and questions from a dependency treebank and a lexical database for Tupian languages.
We conclude that new data gathering processes should be established in partnership with indigenous communities and oriented for educational purposes.
arXiv Detail & Related papers (2024-03-21T16:11:44Z) - "It's how you do things that matters": Attending to Process to Better
Serve Indigenous Communities with Language Technologies [2.821682550792172]
This position paper explores ethical considerations in building NLP technologies for Indigenous languages.
We report on interviews with 17 researchers working in or with Aboriginal and/or Torres Strait Islander communities.
We recommend practices for NLP researchers to increase attention to the process of engagements with Indigenous communities.
arXiv Detail & Related papers (2024-02-04T23:23:51Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Towards Bridging the Digital Language Divide [4.234367850767171]
multilingual language processing systems often exhibit a hardwired, yet usually involuntary and hidden representational preference towards certain languages.
We show that biased technology is often the result of research and development methodologies that do not do justice to the complexity of the languages being represented.
We present a new initiative that aims at reducing linguistic bias through both technological design and methodology.
arXiv Detail & Related papers (2023-07-25T10:53:20Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Including Signed Languages in Natural Language Processing [48.62744923724317]
Signed languages are the primary means of communication for many deaf and hard of hearing individuals.
This position paper calls on the NLP community to include signed languages as a research area with high social and scientific impact.
arXiv Detail & Related papers (2021-05-11T17:37:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.