AfroLID: A Neural Language Identification Tool for African Languages
- URL: http://arxiv.org/abs/2210.11744v2
- Date: Mon, 24 Oct 2022 18:25:36 GMT
- Title: AfroLID: A Neural Language Identification Tool for African Languages
- Authors: Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed and Alcides
Alcoba Inciarte
- Abstract summary: AfroLID is a neural LID toolkit for $517$ African languages and varieties.
It exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems.
- Score: 5.945320097465418
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language identification (LID) is a crucial precursor for NLP, especially for
mining web data. Problematically, most of the world's 7000+ languages today are
not covered by LID technologies. We address this pressing issue for Africa by
introducing AfroLID, a neural LID toolkit for $517$ African languages and
varieties. AfroLID exploits a multi-domain web dataset manually curated from
across 14 language families utilizing five orthographic systems. When evaluated
on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare
AfroLID to five existing LID tools that each cover a small number of African
languages, finding it to outperform them on most languages. We further show the
utility of AfroLID in the wild by testing it on the acutely under-served
Twitter domain. Finally, we offer a number of controlled case studies and
perform a linguistically-motivated error analysis that allow us to both
showcase AfroLID's powerful capabilities and limitations.
Related papers
- Cheetah: Natural Language Generation for 517 African Languages [21.347462833831223]
We develop Cheetah, a massively multilingual NLG language model for African languages.
Cheetah supports 517 African languages and language varieties.
The introduction of Cheetah has far-reaching benefits for linguistic diversity.
arXiv Detail & Related papers (2024-01-02T06:24:13Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - AfroDigits: A Community-Driven Spoken Digit Dataset for African
Languages [32.23306825605942]
AfroDigits is a minimalist dataset of spoken digits for African languages.
We conduct audio digit classification experiments on six African languages.
AfroDigits is the first published audio digit dataset for African languages.
arXiv Detail & Related papers (2023-03-22T14:09:20Z) - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - MasakhaNER: Named Entity Recognition for African Languages [48.34339599387944]
We create the first large publicly available high-quality dataset for named entity recognition in ten African languages.
We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER.
arXiv Detail & Related papers (2021-03-22T13:12:44Z) - Lanfrica: A Participatory Approach to Documenting Machine Translation
Research on African Languages [0.012691047660244334]
Africa has the highest language diversity, with 1500-2000 documented languages and many more undocumented or extinct languages.
This makes it hard to keep track of the MT research, models and dataset that have been developed for some of them.
Online platforms can be useful creating accessibility to researches, benchmarks and datasets in these African languages.
arXiv Detail & Related papers (2020-08-03T18:14:04Z) - AI4D -- African Language Dataset Challenge [1.4922337373437886]
This work details the organisation of the AI4D - African Language dataset Challenge.
It is an effort to incentivize the creation, organization and discovery of African language datasets.
We particularly encouraged the submission of annotated datasets which can be used for training task-specific supervised machine learning models.
arXiv Detail & Related papers (2020-07-23T08:48:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.