MasakhaNEWS: News Topic Classification for African languages
- URL: http://arxiv.org/abs/2304.09972v2
- Date: Wed, 20 Sep 2023 17:14:40 GMT
- Title: MasakhaNEWS: News Topic Classification for African languages
- Authors: David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba
Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure
F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, sana
al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi,
Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abraham Owodunni, Nnaemeka
Obiefuna, Muhidin Mohamed, Shamsuddeen Hassan Muhammad, Teshome Mulugeta
Ababu, Saheed Abdullahi Salahudeen, Mesay Gemeda Yigezu, Tajuddeen Gwadabe,
Idris Abdulmumin, Mahlet Taye, Oluwabusayo Awoyomi, Iyanuoluwa Shode,
Tolulope Adelani, Habiba Abdulganiyu, Abdul-Hakeem Omotayo, Adetola Adeeko,
Abeeb Afolabi, Anuoluwapo Aremu, Olanrewaju Samuel, Clemencia Siro, Wangari
Kimotho, Onyekachi Ogbu, Chinedu Mbonu, Chiamaka Chukwuneke, Samuel Fanijo,
Jessica Ojo, Oyinkansola Awosan, Tadesse Kebede, Toadoum Sari Sakayo, Pamela
Nyatsine, Freedmore Sidume, Oreen Yousuf, Mardiyyah Oduwole, Tshinu Tshinu,
Ussen Kimanuka, Thina Diko, Siyanda Nxakama, Sinodos Nigusse, Abdulmejid
Johar, Shafie Mohamed, Fuad Mire Hassan, Moges Ahmed Mehamed, Evrard Ngabire,
Jules Jules, Ivan Ssenkungu and Pontus Stenetorp
- Abstract summary: African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks.
We develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa.
- Score: 15.487928928173098
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: African languages are severely under-represented in NLP research due to lack
of datasets covering several NLP tasks. While there are individual language
specific datasets that are being expanded to different tasks, only a handful of
NLP tasks (e.g. named entity recognition and machine translation) have
standardized benchmark datasets covering several geographical and
typologically-diverse African languages. In this paper, we develop MasakhaNEWS
-- a new benchmark dataset for news topic classification covering 16 languages
widely spoken in Africa. We provide an evaluation of baseline models by
training classical machine learning models and fine-tuning several language
models. Furthermore, we explore several alternatives to full fine-tuning of
language models that are better suited for zero-shot and few-shot learning such
as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern
exploiting training (PET), prompting language models (like ChatGPT), and
prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API).
Our evaluation in zero-shot setting shows the potential of prompting ChatGPT
for news topic classification in low-resource African languages, achieving an
average performance of 70 F1 points without leveraging additional supervision
like MAD-X. In few-shot setting, we show that with as little as 10 examples per
label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of
full supervised training (92.6 F1 points) leveraging the PET approach.
Related papers
- IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models [18.260317326787035]
This paper introduces IrokoBench -- a human-translated benchmark dataset for 16 typologically-diverse low-resource African languages.
We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings(where test sets are translated into English) across 10 open and four proprietary language models.
We observe a significant performance gap between open and proprietary models, with the highest performing open model, Aya-101 only at 58% of the best-performing proprietary model GPT-4o performance.
arXiv Detail & Related papers (2024-06-05T15:23:08Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - SERENGETI: Massively Multilingual Language Models for Africa [5.945320097465418]
We develop SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties.
We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages.
arXiv Detail & Related papers (2022-12-21T05:54:14Z) - AfroLM: A Self-Active Learning-based Multilingual Pretrained Language
Model for 23 African Languages [0.021987601456703476]
We present AfroLM, a multilingual language model pretrained from scratch on 23 African languages.
AfroLM is pretrained on a dataset 14x smaller than existing baselines.
It is able to generalize well across various domains.
arXiv Detail & Related papers (2022-11-07T02:15:25Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - Self-Training Pre-Trained Language Models for Zero- and Few-Shot
Multi-Dialectal Arabic Sequence Labeling [7.310390479801139]
Self-train pre-trained language models in zero- and few-shot scenarios to improve performance on data-scarce varieties.
Our work opens up opportunities for developing DA models exploiting only MSA resources.
arXiv Detail & Related papers (2021-01-12T21:29:30Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing
Benchmark [31.91964553419665]
We present a new multilingual dataset, called MTOP, comprising of 100k annotated utterances in 6 languages across 11 domains.
We achieve an average improvement of +6.3 points on Slot F1 for the two existing multilingual datasets, over best results reported in their experiments.
We demonstrate strong zero-shot performance using pre-trained models combined with automatic translation and alignment, and a proposed distant supervision method to reduce the noise in slot label projection.
arXiv Detail & Related papers (2020-08-21T07:02:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.