MasakhaNER: Named Entity Recognition for African Languages
- URL: http://arxiv.org/abs/2103.11811v1
- Date: Mon, 22 Mar 2021 13:12:44 GMT
- Title: MasakhaNER: Named Entity Recognition for African Languages
- Authors: David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D'souza,
Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba,
Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime,
Shamsuddeen Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez
Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jesujoba Alabi,
Seid Muhie Yimam, Tajuddeen Gwadabe, Ignatius Ezeani, Rubungo Andre
Niyongabo, Jonathan Mukiibi, Verrah Otiende, Iroro Orife, Davis David, Samba
Ngom, Tosin Adewumi, Paul Rayson, Mofetoluwa Adeyemi, Gerald Muriuki,
Emmanuel Anebi, Chiamaka Chukwuneke, Nkiruka Odu, Eric Peter Wairagala,
Samuel Oyerinde, Clemencia Siro, Tobius Saul Bateesa, Temilola Oloyede,
Yvonne Wambui, Victor Akinode, Deborah Nabagereka, Maurice Katusiime, Ayodele
Awokoya, Mouhamadane MBOUP, Dibora Gebreyohannes, Henok Tilaye, Kelechi
Nwaike, Degaga Wolde, Abdoulaye Faye, Blessing Sibanda, Orevaoghene Ahia,
Bonaventure F. P. Dossou, Kelechi Ogueji, Thierno Ibrahima DIOP, Abdoulaye
Diallo, Adewale Akinfaderin, Tendai Marengereke, and Salomey Osei
- Abstract summary: We create the first large publicly available high-quality dataset for named entity recognition in ten African languages.
We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER.
- Score: 48.34339599387944
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We take a step towards addressing the under-representation of the African
continent in NLP research by creating the first large publicly available
high-quality dataset for named entity recognition (NER) in ten African
languages, bringing together a variety of stakeholders. We detail
characteristics of the languages to help researchers understand the challenges
that these languages pose for NER. We analyze our datasets and conduct an
extensive empirical evaluation of state-of-the-art methods across both
supervised and transfer learning settings. We release the data, code, and
models in order to inspire future research on African NLP.
Related papers
- State of NLP in Kenya: A Survey [0.25454395163615406]
Kenya, known for its linguistic diversity, faces unique challenges and promising opportunities in advancing Natural Language Processing.
This survey provides a detailed assessment of the current state of NLP in Kenya.
The paper uncovers significant gaps by critically evaluating the available datasets and existing NLP models.
arXiv Detail & Related papers (2024-10-13T18:08:24Z) - The Ghanaian NLP Landscape: A First Look [9.17372840572907]
Ghanaian languages, in particular, face an alarming decline, with documented extinction and several at risk.
This study pioneers a comprehensive survey of Natural Language Processing (NLP) research focused on Ghanaian languages.
arXiv Detail & Related papers (2024-05-10T21:39:09Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - OkwuGb\'e: End-to-End Speech Recognition for Fon and Igbo [0.015863809575305417]
We present a state-of-art ASR model for Fon, as well as benchmark ASR model results for Igbo.
We conduct a comprehensive linguistic analysis of each language and describe the creation of end-to-end, deep neural network-based speech recognition models for both languages.
arXiv Detail & Related papers (2021-03-13T18:02:44Z) - AI4D -- African Language Dataset Challenge [1.4922337373437886]
This work details the organisation of the AI4D - African Language dataset Challenge.
It is an effort to incentivize the creation, organization and discovery of African language datasets.
We particularly encouraged the submission of annotated datasets which can be used for training task-specific supervised machine learning models.
arXiv Detail & Related papers (2020-07-23T08:48:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.