Marito: Structuring and Building Open Multilingual Terminologies for South African NLP
- URL: http://arxiv.org/abs/2508.03529v1
- Date: Tue, 05 Aug 2025 15:00:02 GMT
- Title: Marito: Structuring and Building Open Multilingual Terminologies for South African NLP
- Authors: Vukosi Marivate, Isheanesu Dzingirai, Fiskani Banda, Richard Lastrucci, Thapelo Sindane, Keabetswe Madumo, Kayode Olaleye, Abiodun Modupe, Unarine Netshifhefhe, Herkulaas Combrink, Mohlatlego Nakeng, Matome Ledwaba,
- Abstract summary: Lack of structured terminological data for South Africa's official languages hampers progress in multilingual NLP.<n>We introduce the foundational emphMarito dataset, released under the equitable, Africa-centered NOODL framework.<n>Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation.
- Score: 0.9273919298354661
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The critical lack of structured terminological data for South Africa's official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. \emph{Marito} addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational \emph{Marito} dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. \emph{Marito} provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa's rich linguistic diversity is represented in the digital age.
Related papers
- Natural language processing for African languages [7.884789325654572]
dissertation focuses on languages spoken in Sub-Saharan Africa where all the indigenous languages can be regarded as low-resourced.<n>We show that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data.<n>We develop large scale human-annotated labelled datasets for 21 African languages in two impactful NLP tasks.
arXiv Detail & Related papers (2025-06-30T22:26:36Z) - Bridging Gaps in Natural Language Processing for Yorùbá: A Systematic Review of a Decade of Progress and Prospects [0.6554326244334868]
This review highlights the scarcity of annotated corpora, limited availability of pre-trained language models, and linguistic challenges like tonal complexity and diacritic dependency as significant obstacles.<n>The findings reveal a growing body of multilingual and monolingual resources, even though the field is constrained by socio-cultural factors such as code-switching and desertion of language for digital usage.
arXiv Detail & Related papers (2025-02-24T17:41:48Z) - A Multilingual Sentiment Lexicon for Low-Resource Language Translation using Large Languages Models and Explainable AI [0.0]
South Africa and the DRC present a complex linguistic landscape with languages such as Zulu, Sepedi, Afrikaans, French, English, and Tshiluba.
This study develops a multilingual lexicon designed for French and Tshiluba, now expanded to include translations in English, Afrikaans, Sepedi, and Zulu.
A comprehensive testing corpus is created to support translation and sentiment analysis tasks, with machine learning models trained to predict sentiment.
arXiv Detail & Related papers (2024-11-06T23:41:18Z) - Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages.
Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Bootstrapping NLP tools across low-resourced African languages: an
overview and prospects [1.5736899098702972]
bootstrapping tools for one African language from another.
bootstrapping grammars for geographically distant languages has been shown to still have positive outcomes for morphology and rules or grammar-based natural language generation.
arXiv Detail & Related papers (2022-10-21T15:16:45Z) - Multilingual Word Sense Disambiguation with Unified Sense Representation [55.3061179361177]
We propose building knowledge and supervised-based Multilingual Word Sense Disambiguation (MWSD) systems.
We build unified sense representations for multiple languages and address the annotation scarcity problem for MWSD by transferring annotations from rich-sourced languages to poorer ones.
Evaluations of SemEval-13 and SemEval-15 datasets demonstrate the effectiveness of our methodology.
arXiv Detail & Related papers (2022-10-14T01:24:03Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - MasakhaNER: Named Entity Recognition for African Languages [48.34339599387944]
We create the first large publicly available high-quality dataset for named entity recognition in ten African languages.
We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER.
arXiv Detail & Related papers (2021-03-22T13:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.