AfroScope: A Framework for Studying the Linguistic Landscape of Africa
- URL: http://arxiv.org/abs/2601.13346v1
- Date: Mon, 19 Jan 2026 19:30:35 GMT
- Title: AfroScope: A Framework for Studying the Linguistic Landscape of Africa
- Authors: Sang Yun Kwon, AbdelRahim Elmadany, Muhammad Abdul-Mageed,
- Abstract summary: We introduce AfroScope, a unified framework for African LID, including AfroScope-Data and AfroScope-Models.<n>We propose a hierarchical classification approach that leverages Mirror-Serengeti, a specialized embedding model targeting 29 closely related or geographically proximate languages.<n>We analyze cross linguistic transfer and domain effects, offering guidance for building robust African LID systems.
- Score: 27.262469904340836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language Identification (LID) is the task of determining the language of a given text and is a fundamental preprocessing step that affects the reliability of downstream NLP applications. While recent work has expanded LID coverage for African languages, existing approaches remain limited in (i) the number of supported languages and (ii) their ability to make fine-grained distinctions among closely related varieties. We introduce AfroScope, a unified framework for African LID that includes AfroScope-Data, a dataset covering 713 African languages, and AfroScope-Models, a suite of strong LID models with broad language coverage. To better distinguish highly confusable languages, we propose a hierarchical classification approach that leverages Mirror-Serengeti, a specialized embedding model targeting 29 closely related or geographically proximate languages. This approach improves macro F1 by 4.55 on this confusable subset compared to our best base model. Finally, we analyze cross linguistic transfer and domain effects, offering guidance for building robust African LID systems. We position African LID as an enabling technology for large scale measurement of Africas linguistic landscape in digital text and release AfroScope-Data and AfroScope-Models publicly.
Related papers
- Scaling HuBERT for African Languages: From Base to Large and XL [0.5825599299113071]
This work introduces SSA-HuBERT-Large (317M parameters) and SSA-HuBERT-XL (964M parameters)<n>The first large models trained solely on African speech, alongside a BASE size counterpart.<n>By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets.
arXiv Detail & Related papers (2025-11-28T17:17:40Z) - Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP [3.0720023574418622]
We introduce the foundational Mafoko dataset, released under the equitable, Africa-centered NOODL framework.<n> Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation.
arXiv Detail & Related papers (2025-08-05T15:00:02Z) - Designing and Contextualising Probes for African Languages [3.161415847253143]
This paper presents the first systematic investigation into probing PLMs for linguistic knowledge about African languages.<n>We train layer-wise probes for six typologically diverse African languages to analyse how linguistic features are distributed.<n>We find PLMs adapted for African languages to encode more linguistic information about target languages than massively multilingual PLMs.
arXiv Detail & Related papers (2025-05-15T08:35:14Z) - Lugha-Llama: Adapting Large Language Models for African Languages [48.97516583523523]
Large language models (LLMs) have achieved impressive results in a wide range of natural language applications.<n>We consider how to adapt LLMs to low-resource African languages.<n>We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model's performance on these languages.
arXiv Detail & Related papers (2025-04-09T02:25:53Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - AfroBench: How Good are Large Language Models on African Languages? [55.35674466745322]
AfroBench is a benchmark for evaluating the performance of LLMs across 64 African languages.<n>AfroBench consists of nine natural language understanding datasets, six text generation datasets, six knowledge and question answering tasks, and one mathematical reasoning task.
arXiv Detail & Related papers (2023-11-14T08:10:14Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - AfroLID: A Neural Language Identification Tool for African Languages [5.945320097465418]
AfroLID is a neural LID toolkit for $517$ African languages and varieties.
It exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems.
arXiv Detail & Related papers (2022-10-21T05:45:50Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - MasakhaNER: Named Entity Recognition for African Languages [48.34339599387944]
We create the first large publicly available high-quality dataset for named entity recognition in ten African languages.
We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER.
arXiv Detail & Related papers (2021-03-22T13:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.