MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition
- URL: http://arxiv.org/abs/2210.12391v1
- Date: Sat, 22 Oct 2022 08:53:14 GMT
- Title: MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition
- Authors: David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti
Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba
O. Alabi, Shamsuddeen H. Muhammad, Peter Nabende, Cheikh M. Bamba Dione,
Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing
Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye,
Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu,
Perez Ogayo, Catherine Gitau, Edwin Munkoh-Buabeng, Victoire M. Koagne,
Allahsera Auguste Tapo, Tebogo Macucwa, Vukosi Marivate, Elvis Mboning,
Tajuddeen Gwadabe, Tosin Adewumi, Orevaoghene Ahia, Joyce Nakatumba-Nabende,
Neo L. Mokono, Ignatius Ezeani, Chiamaka Chukwuneke, Mofetoluwa Adeyemi,
Gilles Q. Hacheme, Idris Abdulmumin, Odunayo Ogundepo, Oreen Yousuf, Tatiana
Moteu Ngoli, Dietrich Klakow
- Abstract summary: African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
- Score: 55.95128479289923
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: African languages are spoken by over a billion people, but are
underrepresented in NLP research and development. The challenges impeding
progress include the limited availability of annotated datasets, as well as a
lack of understanding of the settings where current methods are effective. In
this paper, we make progress towards solutions for these challenges, focusing
on the task of named entity recognition (NER). We create the largest
human-annotated NER dataset for 20 African languages, and we study the behavior
of state-of-the-art cross-lingual transfer methods in an Africa-centric
setting, demonstrating that the choice of source language significantly affects
performance. We show that choosing the best transfer language improves
zero-shot F1 scores by an average of 14 points across 20 languages compared to
using English. Our results highlight the need for benchmark datasets and models
that cover typologically-diverse African languages.
Related papers
- Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - DN at SemEval-2023 Task 12: Low-Resource Language Text Classification
via Multilingual Pretrained Language Model Fine-tuning [0.0]
Most existing models and datasets for sentiment analysis are developed for high-resource languages, such as English and Chinese.
The AfriSenti-SemEval 2023 Shared Task 12 aims to fill this gap by evaluating sentiment analysis models on low-resource African languages.
We present our solution to the shared task, where we employed different multilingual XLM-R models with classification head trained on various data.
arXiv Detail & Related papers (2023-05-04T07:28:45Z) - NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language
Selection for Low-Resource Multilingual Sentiment Analysis [11.05909046179595]
This paper describes our system developed for the SemEval-2023 Task 12 "Sentiment Analysis for Low-resource African languages using Twitter dataset"
Our key findings are: Adapting the pretrained model to the target language and task using a small yet relevant corpus improves performance remarkably by more than 10 F1 score points.
In the shared task, our system wins 8 out of 15 tracks and, in particular, performs best in the multilingual evaluation.
arXiv Detail & Related papers (2023-04-28T21:02:58Z) - MasakhaNEWS: News Topic Classification for African languages [15.487928928173098]
African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks.
We develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa.
arXiv Detail & Related papers (2023-04-19T21:12:23Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - MasakhaNER: Named Entity Recognition for African Languages [48.34339599387944]
We create the first large publicly available high-quality dataset for named entity recognition in ten African languages.
We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER.
arXiv Detail & Related papers (2021-03-22T13:12:44Z) - Lanfrica: A Participatory Approach to Documenting Machine Translation
Research on African Languages [0.012691047660244334]
Africa has the highest language diversity, with 1500-2000 documented languages and many more undocumented or extinct languages.
This makes it hard to keep track of the MT research, models and dataset that have been developed for some of them.
Online platforms can be useful creating accessibility to researches, benchmarks and datasets in these African languages.
arXiv Detail & Related papers (2020-08-03T18:14:04Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.