Not always about you: Prioritizing community needs when developing
endangered language technology
- URL: http://arxiv.org/abs/2204.05541v1
- Date: Tue, 12 Apr 2022 05:59:39 GMT
- Title: Not always about you: Prioritizing community needs when developing
endangered language technology
- Authors: Zoey Liu, Crystal Richardson, Richard Hatcher Jr and Emily
Prud'hommeaux
- Abstract summary: We discuss the unique technological, cultural, practical, and ethical challenges that researchers and indigenous speech community members face.
We report the perspectives of language teachers, Master Speakers and elders from indigenous communities, as well as the point of view of academics.
- Score: 5.670857685983896
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Languages are classified as low-resource when they lack the quantity of data
necessary for training statistical and machine learning tools and models.
Causes of resource scarcity vary but can include poor access to technology for
developing these resources, a relatively small population of speakers, or a
lack of urgency for collecting such resources in bilingual populations where
the second language is high-resource. As a result, the languages described as
low-resource in the literature are as different as Finnish on the one hand,
with millions of speakers using it in every imaginable domain, and Seneca, with
only a small-handful of fluent speakers using the language primarily in a
restricted domain. While issues stemming from the lack of resources necessary
to train models unite this disparate group of languages, many other issues cut
across the divide between widely-spoken low resource languages and endangered
languages. In this position paper, we discuss the unique technological,
cultural, practical, and ethical challenges that researchers and indigenous
speech community members face when working together to develop language
technology to support endangered language documentation and revitalization. We
report the perspectives of language teachers, Master Speakers and elders from
indigenous communities, as well as the point of view of academics. We describe
an ongoing fruitful collaboration and make recommendations for future
partnerships between academic researchers and language community stakeholders.
Related papers
- LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models [62.47865866398233]
This white paper proposes a framework to generate linguistic tools for low-resource languages.
By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
arXiv Detail & Related papers (2024-11-20T16:59:41Z) - A Capabilities Approach to Studying Bias and Harm in Language Technologies [4.135516576952934]
We consider fairness, bias, and inclusion in Language Technologies through the lens of the Capabilities Approach.
The Capabilities Approach centers on what people are capable of achieving, given their intersectional social, political, and economic contexts.
We detail the Capabilities Approach, its relationship to multilingual and multicultural evaluation, and how the framework affords meaningful collaboration with community members in defining and measuring the harms of Language Technologies.
arXiv Detail & Related papers (2024-11-06T22:46:13Z) - Content-Localization based Neural Machine Translation for Informal
Dialectal Arabic: Spanish/French to Levantine/Gulf Arabic [5.2957928879391]
We propose a framework that localizes contents of high-resource languages to a low-resource language/dialects by utilizing AI power.
We are the first work to provide a parallel translation dataset from/to informal Spanish and French to/from informal Arabic dialects.
arXiv Detail & Related papers (2023-12-12T01:42:41Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Towards Bridging the Digital Language Divide [4.234367850767171]
multilingual language processing systems often exhibit a hardwired, yet usually involuntary and hidden representational preference towards certain languages.
We show that biased technology is often the result of research and development methodologies that do not do justice to the complexity of the languages being represented.
We present a new initiative that aims at reducing linguistic bias through both technological design and methodology.
arXiv Detail & Related papers (2023-07-25T10:53:20Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Overcoming Language Disparity in Online Content Classification with
Multimodal Learning [22.73281502531998]
Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks.
The development of advanced computational techniques and resources is disproportionately focused on the English language.
We explore the promise of incorporating the information contained in images via multimodal machine learning.
arXiv Detail & Related papers (2022-05-19T17:56:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.