Bootstrapping NLP tools across low-resourced African languages: an
overview and prospects
- URL: http://arxiv.org/abs/2210.12027v1
- Date: Fri, 21 Oct 2022 15:16:45 GMT
- Title: Bootstrapping NLP tools across low-resourced African languages: an
overview and prospects
- Authors: C. Maria Keet
- Abstract summary: bootstrapping tools for one African language from another.
bootstrapping grammars for geographically distant languages has been shown to still have positive outcomes for morphology and rules or grammar-based natural language generation.
- Score: 1.5736899098702972
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computing and Internet access are substantially growing markets in Southern
Africa, which brings with it increasing demands for local content and tools in
indigenous African languages. Since most of those languages are low-resourced,
efforts have gone into the notion of bootstrapping tools for one African
language from another. This paper provides an overview of these efforts for
Niger-Congo B (`Bantu') languages. Bootstrapping grammars for geographically
distant languages has been shown to still have positive outcomes for morphology
and rules or grammar-based natural language generation. Bootstrapping with
data-driven approaches to NLP tasks is difficult to use meaningfully regardless
geographic proximity, which is largely due to lexical diversity due to both
orthography and vocabulary. Cladistic approaches in comparative linguistics may
inform bootstrapping strategies and similarity measures might serve as proxy
for bootstrapping potential as well, with both fertile ground for further
research.
Related papers
- Bridging Gaps in Natural Language Processing for Yorùbá: A Systematic Review of a Decade of Progress and Prospects [0.6554326244334868]
This review highlights the scarcity of annotated corpora, limited availability of pre-trained language models, and linguistic challenges like tonal complexity and diacritic dependency as significant obstacles.
The findings reveal a growing body of multilingual and monolingual resources, even though the field is constrained by socio-cultural factors such as code-switching and desertion of language for digital usage.
arXiv Detail & Related papers (2025-02-24T17:41:48Z) - Exploring transfer learning for Deep NLP systems on rarely annotated languages [0.0]
This thesis investigates the application of transfer learning for Part-of-Speech (POS) tagging between Hindi and Nepali.
We assess whether multitask learning in Hindi, with auxiliary tasks such as gender and singular/plural tagging, can contribute to improved POS tagging accuracy.
arXiv Detail & Related papers (2024-10-15T13:33:54Z) - Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
Lens is a novel approach to enhance multilingual capabilities of large language models (LLMs)
It operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs.
It achieves superior results with much fewer computational resources compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z) - Connecting Ideas in 'Lower-Resource' Scenarios: NLP for National Varieties, Creoles and Other Low-resource Scenarios [11.460959151493055]
Despite excellent results on benchmarks over a small subset of languages, large language models struggle to process text from languages situated in lower-resource' scenarios.
This tutorial will identify common challenges, approaches, and themes in natural language processing (NLP) research for confronting and overcoming the obstacles inherent to data-poor contexts.
arXiv Detail & Related papers (2024-09-19T11:48:42Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages.
Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - A Survey of Knowledge Enhanced Pre-trained Language Models [78.56931125512295]
We present a comprehensive review of Knowledge Enhanced Pre-trained Language Models (KE-PLMs)
For NLU, we divide the types of knowledge into four categories: linguistic knowledge, text knowledge, knowledge graph (KG) and rule knowledge.
The KE-PLMs for NLG are categorized into KG-based and retrieval-based methods.
arXiv Detail & Related papers (2022-11-11T04:29:02Z) - The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.