Related papers: AfroLID: A Neural Language Identification Tool for African Languages

AfroLID: A Neural Language Identification Tool for African Languages

URL: http://arxiv.org/abs/2210.11744v2
Date: Mon, 24 Oct 2022 18:25:36 GMT
Title: AfroLID: A Neural Language Identification Tool for African Languages
Authors: Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed and Alcides Alcoba Inciarte
Abstract summary: AfroLID is a neural LID toolkit for $517$ African languages and varieties. It exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems.
Score: 5.945320097465418
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for $517$ African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID's powerful capabilities and limitations.

Related papers

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data [56.043078390377076]
We introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain.<n>We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models.<n>We highlight that existing evaluations overestimate LID accuracy for many languages in the web domain.
arXiv Detail & Related papers (2026-01-25T22:49:30Z)
AfroScope: A Framework for Studying the Linguistic Landscape of Africa [27.262469904340836]
We introduce AfroScope, a unified framework for African LID, including AfroScope-Data and AfroScope-Models.<n>We propose a hierarchical classification approach that leverages Mirror-Serengeti, a specialized embedding model targeting 29 closely related or geographically proximate languages.<n>We analyze cross linguistic transfer and domain effects, offering guidance for building robust African LID systems.
arXiv Detail & Related papers (2026-01-19T19:30:35Z)
The State of Large Language Models for African Languages: Progress and Challenges [4.065633096286487]
This paper comparatively analyzes African language coverage across six Large Language Models (LLMs), eight Small Language Models (SLMs), and six Specialized SLMs (SSLMs)<n>The evaluation covers language coverage, training sets, technical limitations, script problems, and language modelling roadmaps.
arXiv Detail & Related papers (2025-06-02T21:39:40Z)
Lugha-Llama: Adapting Large Language Models for African Languages [48.97516583523523]
Large language models (LLMs) have achieved impressive results in a wide range of natural language applications. We consider how to adapt LLMs to low-resource African languages. We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model's performance on these languages.
arXiv Detail & Related papers (2025-04-09T02:25:53Z)
Cheetah: Natural Language Generation for 517 African Languages [21.347462833831223]
We develop Cheetah, a massively multilingual NLG language model for African languages. Cheetah supports 517 African languages and language varieties. The introduction of Cheetah has far-reaching benefits for linguistic diversity.
arXiv Detail & Related papers (2024-01-02T06:24:13Z)
Zero-Shot Cross-Lingual Reranking with Large Language Models for Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages. Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba) We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z)
AfroBench: How Good are Large Language Models on African Languages? [55.35674466745322]
AfroBench is a benchmark for evaluating the performance of LLMs across 64 African languages. AfroBench consists of nine natural language understanding datasets, six text generation datasets, six knowledge and question answering tasks, and one mathematical reasoning task.
arXiv Detail & Related papers (2023-11-14T08:10:14Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
AfroDigits: A Community-Driven Spoken Digit Dataset for African Languages [32.23306825605942]
AfroDigits is a minimalist dataset of spoken digits for African languages. We conduct audio digit classification experiments on six African languages. AfroDigits is the first published audio digit dataset for African languages.
arXiv Detail & Related papers (2023-03-22T14:09:20Z)
AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents. Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z)
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development. We create the largest human-annotated NER dataset for 20 African languages. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z)
AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages. We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages. We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z)
MasakhaNER: Named Entity Recognition for African Languages [48.34339599387944]
We create the first large publicly available high-quality dataset for named entity recognition in ten African languages. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER.
arXiv Detail & Related papers (2021-03-22T13:12:44Z)
Lanfrica: A Participatory Approach to Documenting Machine Translation Research on African Languages [0.012691047660244334]
Africa has the highest language diversity, with 1500-2000 documented languages and many more undocumented or extinct languages. This makes it hard to keep track of the MT research, models and dataset that have been developed for some of them. Online platforms can be useful creating accessibility to researches, benchmarks and datasets in these African languages.
arXiv Detail & Related papers (2020-08-03T18:14:04Z)
AI4D -- African Language Dataset Challenge [1.4922337373437886]
This work details the organisation of the AI4D - African Language dataset Challenge. It is an effort to incentivize the creation, organization and discovery of African language datasets. We particularly encouraged the submission of annotated datasets which can be used for training task-specific supervised machine learning models.
arXiv Detail & Related papers (2020-07-23T08:48:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.