Awal -- Community-Powered Language Technology for Tamazight
- URL: http://arxiv.org/abs/2510.27407v1
- Date: Fri, 31 Oct 2025 11:53:05 GMT
- Title: Awal -- Community-Powered Language Technology for Tamazight
- Authors: Alp Öktem, Farida Boudichat,
- Abstract summary: Awal is a community-powered initiative for developing language technology resources for Tamazight.<n>We analyze 18 months of community engagement, revealing significant barriers to participation.<n>The modest scale of community contributions highlights the limitations of applying standard crowdsourcing approaches to languages with complex sociolinguistic contexts.
- Score: 0.21687011163378758
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents Awal, a community-powered initiative for developing language technology resources for Tamazight. We provide a comprehensive review of the NLP landscape for Tamazight, examining recent progress in computational resources, and the emergence of community-driven approaches to address persistent data scarcity. Launched in 2024, awaldigital.org platform addresses the underrepresentation of Tamazight in digital spaces through a collaborative platform enabling speakers to contribute translation and voice data. We analyze 18 months of community engagement, revealing significant barriers to participation including limited confidence in written Tamazight and ongoing standardization challenges. Despite widespread positive reception, actual data contribution remained concentrated among linguists and activists. The modest scale of community contributions -- 6,421 translation pairs and 3 hours of speech data -- highlights the limitations of applying standard crowdsourcing approaches to languages with complex sociolinguistic contexts. We are working on improved open-source MT models using the collected data.
Related papers
- Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research [0.6016863427924156]
This paper provides the first comprehensive overview of progress and challenges for the six national languages officially recognized by the Senegalese Constitution: Wolof, Pulaar, Sereer, Joola, Mandingue, and Soninke.<n>We synthesize linguistic, sociotechnical, and infrastructural factors that shape their digital readiness and identify gaps in data, tools, and benchmarks.<n>The paper concludes by outlining a roadmap toward sustainable, community-centered NLP ecosystems for Senegalese languages.
arXiv Detail & Related papers (2025-12-24T20:20:31Z) - AdiBhashaa: A Community-Curated Benchmark for Machine Translation into Indian Tribal Languages [3.2873201228433846]
We present AdiBhashaa, a community-driven initiative that constructs the first open parallel corpora and baseline MT systems for four major Indian tribal languages.<n>This work combines participatory data creation with native speakers, human-in-the-loop validation, and systematic evaluation of both encoder-decoder MT models and large language models.
arXiv Detail & Related papers (2025-12-04T13:01:17Z) - The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP [4.188487384419692]
Despite representing nearly one-third of the world's languages, African languages remain critically underserved by modern NLP technologies.<n>We present the African Languages Lab, a comprehensive research initiative that addresses this technological gap through systematic data collection, model development, and capacity building.
arXiv Detail & Related papers (2025-10-07T07:42:52Z) - Towards Open-Ended Discovery for Low-Resource NLP [2.31792878608513]
We argue for a paradigm shift toward open-ended, interactive language discovery.<n>We propose a framework grounded in joint human-machine uncertainty.<n>This paper is a call to action: we advocate a rethinking of how AI engages with human knowledge in under-documented languages.
arXiv Detail & Related papers (2025-09-22T01:19:04Z) - BTPD: A Multilingual Hand-curated Dataset of Bengali Transnational Political Discourse Across Online Communities [25.55378198149251]
We present a multilingual dataset of Bengali political discourse (BTPD) collected from three online platforms.<n>This paper also provides a general overview of its topics and multilingual content.
arXiv Detail & Related papers (2025-06-07T14:43:35Z) - Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation [7.383944919243126]
We propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages.<n>By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto.
arXiv Detail & Related papers (2025-04-07T15:18:34Z) - Bridging the Data Provenance Gap Across Text, Speech and Video [67.72097952282262]
We conduct the largest and first-of-its-kind longitudinal audit across modalities of popular text, speech, and video datasets.<n>Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries.<n>We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets.
arXiv Detail & Related papers (2024-12-19T01:30:19Z) - LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models [62.47865866398233]
This white paper proposes a framework to generate linguistic tools for low-resource languages.
By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
arXiv Detail & Related papers (2024-11-20T16:59:41Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Learnings from Technological Interventions in a Low Resource Language: A
Case-Study on Gondi [13.9876704685177]
Gondi is a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India.
At the end of these interventions, we collected a little less than 12,000 translated words and/or sentences.
The larger goal of the project is collecting enough data in Gondi to build and deploy viable language technologies.
arXiv Detail & Related papers (2020-04-21T20:03:57Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.