Related papers: HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing

HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing

URL: http://arxiv.org/abs/2505.14311v3
Date: Tue, 22 Jul 2025 10:36:47 GMT
Title: HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing
Authors: Shamsuddeen Hassan Muhammad, Ibrahim Said Ahmad, Idris Abdulmumin, Falalu Ibrahim Lawan, Babangida Sani, Sukairaj Hafiz Imam, Yusuf Aliyu, Sani Abdullahi Sani, Ali Usman Umar, Tajuddeen Gwadabe, Kenneth Church, Vukosi Marivate,
Abstract summary: Hausa is a low-resource language with over 120 million first-language (L1) and 80 million second-language (L2) speakers worldwide.<n>This paper presents an overview of the current state of Hausa NLP, systematically examining existing resources, research contributions, and gaps across fundamental NLP tasks.<n>We introduce HausaNLP, a curated catalog that aggregates datasets, tools, and research works to enhance accessibility and drive further development.
Score: 5.5473811549393774
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hausa Natural Language Processing (NLP) has gained increasing attention in recent years, yet remains understudied as a low-resource language despite having over 120 million first-language (L1) and 80 million second-language (L2) speakers worldwide. While significant advances have been made in high-resource languages, Hausa NLP faces persistent challenges, including limited open-source datasets and inadequate model representation. This paper presents an overview of the current state of Hausa NLP, systematically examining existing resources, research contributions, and gaps across fundamental NLP tasks: text classification, machine translation, named entity recognition, speech recognition, and question answering. We introduce HausaNLP (https://catalog.hausanlp.org), a curated catalog that aggregates datasets, tools, and research works to enhance accessibility and drive further development. Furthermore, we discuss challenges in integrating Hausa into large language models (LLMs), addressing issues of suboptimal tokenization and dialectal variation. Finally, we propose strategic research directions emphasizing dataset expansion, improved language modeling approaches, and strengthened community collaboration to advance Hausa NLP. Our work provides both a foundation for accelerating Hausa NLP progress and valuable insights for broader multilingual NLP research.

Related papers

Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research [0.6016863427924156]
This paper provides the first comprehensive overview of progress and challenges for the six national languages officially recognized by the Senegalese Constitution: Wolof, Pulaar, Sereer, Joola, Mandingue, and Soninke.<n>We synthesize linguistic, sociotechnical, and infrastructural factors that shape their digital readiness and identify gaps in data, tools, and benchmarks.<n>The paper concludes by outlining a roadmap toward sustainable, community-centered NLP ecosystems for Senegalese languages.
arXiv Detail & Related papers (2025-12-24T20:20:31Z)
Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead [24.670007883062475]
Africa represents one of the richest linguistic regions in the world with over 2,000 languages.<n>This diversity is scarcely reflected in state-of-the-art natural language processing systems.<n>We analyze 734 research papers on NLP for African languages published over the past five years.
arXiv Detail & Related papers (2025-05-27T15:13:08Z)
NaijaNLP: A Survey of Nigerian Low-Resource Languages [0.0]
Three languages -- Hausa, Yorub'a and Igbo -- account for about 60% of the spoken languages in Nigeria.<n>These languages are categorised as low-resource due to insufficient resources to support tasks in computational linguistics.<n>This study presents the first comprehensive review of advancements in low-resource NLP (LR-NLP) research across the three major Nigerian languages.
arXiv Detail & Related papers (2025-02-27T05:48:51Z)
Bridging Gaps in Natural Language Processing for Yorùbá: A Systematic Review of a Decade of Progress and Prospects [0.6554326244334868]
This review highlights the scarcity of annotated corpora, limited availability of pre-trained language models, and linguistic challenges like tonal complexity and diacritic dependency as significant obstacles.<n>The findings reveal a growing body of multilingual and monolingual resources, even though the field is constrained by socio-cultural factors such as code-switching and desertion of language for digital usage.
arXiv Detail & Related papers (2025-02-24T17:41:48Z)
A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers [51.8203871494146]
The rapid development of Large Language Models (LLMs) demonstrates remarkable multilingual capabilities in natural language processing.<n>Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient.<n>This survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.
arXiv Detail & Related papers (2024-05-17T17:47:39Z)
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature. We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
Systematic Inequalities in Language Technology Performance across the World's Languages [94.65681336393425]
We introduce a framework for estimating the global utility of language technologies. Our analyses involve the field at large, but also more in-depth studies on both user-facing technologies and more linguistic NLP tasks.
arXiv Detail & Related papers (2021-10-13T14:03:07Z)
Ensuring the Inclusive Use of Natural Language Processing in the Global Response to COVID-19 [58.720142291102135]
We discuss ways in which current and future NLP approaches can be made more inclusive by covering low-resource languages. We suggest several future directions for researchers interested in maximizing the positive societal impacts of NLP.
arXiv Detail & Related papers (2021-08-11T12:54:26Z)
Including Signed Languages in Natural Language Processing [48.62744923724317]
Signed languages are the primary means of communication for many deaf and hard of hearing individuals. This position paper calls on the NLP community to include signed languages as a research area with high social and scientific impact.
arXiv Detail & Related papers (2021-05-11T17:37:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.