Related papers: Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language

Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language

URL: http://arxiv.org/abs/2511.16680v1
Date: Wed, 12 Nov 2025 09:19:49 GMT
Title: Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language
Authors: Happymore Masoka,
Abstract summary: Shona spaCy is an open-source computational morphological analysis tool for the Bantu language.<n>It combines a lexicon with rules to model noun-class prefixes, verbal subjects, tense-aspect markers, ideophones, and clitics.<n>Its accuracy is 90% POS-tagging accuracy and 88% morphological-feature accuracy.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite rapid advances in multilingual natural language processing (NLP), the Bantu language Shona remains under-served in terms of morphological analysis and language-aware tools. This paper presents Shona spaCy, an open-source, rule-based morphological pipeline for Shona built on the spaCy framework. The system combines a curated JSON lexicon with linguistically grounded rules to model noun-class prefixes (Mupanda 1-18), verbal subject concords, tense-aspect markers, ideophones, and clitics, integrating these into token-level annotations for lemma, part-of-speech, and morphological features. The toolkit is available via pip install shona-spacy, with source code at https://github.com/HappymoreMasoka/shona-spacy and a PyPI release at https://pypi.org/project/shona-spacy/0.1.4/. Evaluation on formal and informal Shona corpora yields 90% POS-tagging accuracy and 88% morphological-feature accuracy, while maintaining transparency in its linguistic decisions. By bridging descriptive grammar and computational implementation, Shona spaCy advances NLP accessibility and digital inclusion for Shona speakers and provides a template for morphological analysis tools for other under-resourced Bantu languages.

Related papers

Simultaneous Speech-to-Speech Translation Without Aligned Data [52.467808474293605]
Simultaneous speech translation requires translating source speech into a target language in real-time.<n>We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely.<n>Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks.
arXiv Detail & Related papers (2026-02-11T17:41:01Z)
Corpus-Based Approaches to Igbo Diacritic Restoration [0.23552726065717702]
The capacity of computers to process natural languages is increasing because NLP researchers are pushing its boundaries.<n>Over 95% of the world's 7000 languages are low-resourced for NLP, i.e. they have little or no data, tools, and techniques for NLP work.<n>We present an overview of diacritic ambiguity and a review of previous diacritic disambiguation approaches on other languages.
arXiv Detail & Related papers (2026-01-26T11:30:36Z)
SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers [0.0]
We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation.<n>Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers.<n>We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting.
arXiv Detail & Related papers (2026-01-08T01:05:51Z)
MoVoC: Morphology-Aware Subword Construction for Geez Script Languages [7.7761618950496265]
Subword-based tokenization methods often fail to preserve morphological boundaries.<n>We present MoVoC (Morpheme-aware Subword Vocabulary Construction) and train MoVoC-Tok, a tokenizer that integrates supervised morphological analysis into the subword vocabulary.
arXiv Detail & Related papers (2025-09-10T17:45:10Z)
Tokens with Meaning: A Hybrid Tokenization Approach for NLP [0.2826977330147589]
Tokenization plays a pivotal role in natural language processing (NLP)<n>We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation.<n>The method uses phono normalization, root-affix, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.
arXiv Detail & Related papers (2025-08-19T22:17:42Z)
CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models. CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
SanskritShala: A Neural Sanskrit NLP Toolkit with Web-Based Interface for Pedagogical and Annotation Purposes [13.585440544031584]
We present a neural Sanskrit Natural Language Processing (NLP) toolkit named SanskritShala. Our systems report state-of-the-art performance on available benchmark datasets for all tasks. SanskritShala is deployed as a web-based application, which allows a user to get real-time analysis for the given input.
arXiv Detail & Related papers (2023-02-19T09:58:55Z)
Linking Emergent and Natural Languages via Corpus Transfer [98.98724497178247]
We propose a novel way to establish a link by corpus transfer between emergent languages and natural languages. Our approach showcases non-trivial transfer benefits for two different tasks -- language modeling and image captioning. We also introduce a novel metric to predict the transferability of an emergent language by translating emergent messages to natural language captions grounded on the same images.
arXiv Detail & Related papers (2022-03-24T21:24:54Z)
Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages. We infer this distribution from a sample of typologically diverse training languages. We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z)
Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding. XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model. Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z)
ESPnet-ST: All-in-One Speech Translation Toolkit [57.76342114226599]
ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet. It implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation. We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines.
arXiv Detail & Related papers (2020-04-21T18:38:38Z)
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages [44.8226642800919]
We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages. Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging. We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora.
arXiv Detail & Related papers (2020-03-16T09:05:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.