Dealing with the Hard Facts of Low-Resource African NLP
- URL: http://arxiv.org/abs/2511.18557v1
- Date: Sun, 23 Nov 2025 18:08:36 GMT
- Title: Dealing with the Hard Facts of Low-Resource African NLP
- Authors: Yacouba Diarra, Nouhoum Souleymane Coulibaly, Panga Azazia Kamaté, Madani Amadou Tall, Emmanuel Élisé Koné, Aymane Dembélé, Michael Leventhal,
- Abstract summary: This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language.<n>We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation.
- Score: 0.6700983301090583
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.
Related papers
- Natural language processing for African languages [7.884789325654572]
dissertation focuses on languages spoken in Sub-Saharan Africa where all the indigenous languages can be regarded as low-resourced.<n>We show that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data.<n>We develop large scale human-annotated labelled datasets for 21 African languages in two impactful NLP tasks.
arXiv Detail & Related papers (2025-06-30T22:26:36Z) - Culturally-Nuanced Story Generation for Reasoning in Low-Resource Languages: The Case of Javanese and Sundanese [12.208154616426052]
We test whether large language models (LLMs) can generate culturally nuanced narratives in Javanese and Sundanese.<n>We compare three data creation strategies: (1) LLM-assisted stories prompted with cultural cues, (2) machine translation from Indonesian benchmarks, and (3) native-written stories.<n>We fine-tune models on each dataset and evaluate on a human-authored test set for classification and generation.
arXiv Detail & Related papers (2025-02-18T15:14:58Z) - Matina: A Large-Scale 73B Token Persian Text Corpus [1.396406461086233]
Existing Persian datasets are typically small and lack content diversity, consisting mainly of weblogs and news articles.<n>Matina corpus is a new Persian dataset of 72.9B tokens, carefully preprocessed and deduplicated to ensure high data quality.
arXiv Detail & Related papers (2025-02-13T11:22:19Z) - InkubaLM: A small language model for low-resource African languages [9.426968756845389]
InkubaLM is a small language model with 0.4 billion parameters.
It achieves performance comparable to models with significantly larger parameter counts.
It demonstrates remarkable consistency across multiple languages.
arXiv Detail & Related papers (2024-08-30T05:42:31Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Cross-lingual Approach to Abstractive Summarization [0.0]
Cross-lingual model transfers are successfully applied in low-resource languages.
We used a pretrained English summarization model based on deep neural networks and sequence-to-sequence architecture.
We developed several models with different proportions of target language data for fine-tuning.
arXiv Detail & Related papers (2020-12-08T09:30:38Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - The Tatoeba Translation Challenge -- Realistic Data Sets for Low
Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.