Related papers: Not always about you: Prioritizing community needs when developing endangered language technology

Not always about you: Prioritizing community needs when developing endangered language technology

URL: http://arxiv.org/abs/2204.05541v1
Date: Tue, 12 Apr 2022 05:59:39 GMT
Title: Not always about you: Prioritizing community needs when developing endangered language technology
Authors: Zoey Liu, Crystal Richardson, Richard Hatcher Jr and Emily Prud'hommeaux
Abstract summary: We discuss the unique technological, cultural, practical, and ethical challenges that researchers and indigenous speech community members face. We report the perspectives of language teachers, Master Speakers and elders from indigenous communities, as well as the point of view of academics.
Score: 5.670857685983896
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Languages are classified as low-resource when they lack the quantity of data necessary for training statistical and machine learning tools and models. Causes of resource scarcity vary but can include poor access to technology for developing these resources, a relatively small population of speakers, or a lack of urgency for collecting such resources in bilingual populations where the second language is high-resource. As a result, the languages described as low-resource in the literature are as different as Finnish on the one hand, with millions of speakers using it in every imaginable domain, and Seneca, with only a small-handful of fluent speakers using the language primarily in a restricted domain. While issues stemming from the lack of resources necessary to train models unite this disparate group of languages, many other issues cut across the divide between widely-spoken low resource languages and endangered languages. In this position paper, we discuss the unique technological, cultural, practical, and ethical challenges that researchers and indigenous speech community members face when working together to develop language technology to support endangered language documentation and revitalization. We report the perspectives of language teachers, Master Speakers and elders from indigenous communities, as well as the point of view of academics. We describe an ongoing fruitful collaboration and make recommendations for future partnerships between academic researchers and language community stakeholders.

Related papers

Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies [11.52881045684005]
This tutorial is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages.<n>Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages.
arXiv Detail & Related papers (2025-12-16T16:44:17Z)
Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges [0.0]
Generative AI and large-scale language models (LLM) have emerged as powerful tools in language preservation. This paper examines the role of generative AIs and LLMs in preserving endangered languages, highlighting the risks and challenges associated with their use.
arXiv Detail & Related papers (2025-01-20T14:03:40Z)
LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models [62.47865866398233]
This white paper proposes a framework to generate linguistic tools for low-resource languages. By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
arXiv Detail & Related papers (2024-11-20T16:59:41Z)
A Capabilities Approach to Studying Bias and Harm in Language Technologies [4.135516576952934]
We consider fairness, bias, and inclusion in Language Technologies through the lens of the Capabilities Approach. The Capabilities Approach centers on what people are capable of achieving, given their intersectional social, political, and economic contexts. We detail the Capabilities Approach, its relationship to multilingual and multicultural evaluation, and how the framework affords meaningful collaboration with community members in defining and measuring the harms of Language Technologies.
arXiv Detail & Related papers (2024-11-06T22:46:13Z)
Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce [27.918975040084387]
Rigorous data collection and labeling practices are essential for developing more human-centered and socially aware technologies.<n>We collect feedback from individuals directly involved in and impacted by NLP artefacts for medium- and low-resource languages.
arXiv Detail & Related papers (2024-10-16T15:51:18Z)
Content-Localization based Neural Machine Translation for Informal Dialectal Arabic: Spanish/French to Levantine/Gulf Arabic [5.2957928879391]
We propose a framework that localizes contents of high-resource languages to a low-resource language/dialects by utilizing AI power. We are the first work to provide a parallel translation dataset from/to informal Spanish and French to/from informal Arabic dialects.
arXiv Detail & Related papers (2023-12-12T01:42:41Z)
Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach. We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target. We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Towards Bridging the Digital Language Divide [4.234367850767171]
multilingual language processing systems often exhibit a hardwired, yet usually involuntary and hidden representational preference towards certain languages. We show that biased technology is often the result of research and development methodologies that do not do justice to the complexity of the languages being represented. We present a new initiative that aims at reducing linguistic bias through both technological design and methodology.
arXiv Detail & Related papers (2023-07-25T10:53:20Z)
Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any. We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z)
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia. Most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z)
Overcoming Language Disparity in Online Content Classification with Multimodal Learning [22.73281502531998]
Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks. The development of advanced computational techniques and resources is disproportionately focused on the English language. We explore the promise of incorporating the information contained in images via multimodal machine learning.
arXiv Detail & Related papers (2022-05-19T17:56:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.