Masakhane -- Machine Translation For Africa
- URL: http://arxiv.org/abs/2003.11529v1
- Date: Fri, 13 Mar 2020 09:01:02 GMT
- Title: Masakhane -- Machine Translation For Africa
- Authors: Iroro Orife, Julia Kreutzer, Blessing Sibanda, Daniel Whitenack,
Kathleen Siminyu, Laura Martinus, Jamiil Toure Ali, Jade Abbott, Vukosi
Marivate, Salomon Kabongo, Musie Meressa, Espoir Murhabazi, Orevaoghene Ahia,
Elan van Biljon, Arshath Ramkilowan, Adewale Akinfaderin, Alp \"Oktem, Wole
Akin, Ghollah Kioko, Kevin Degila, Herman Kamper, Bonaventure Dossou, Chris
Emezue, Kelechi Ogueji, Abdallah Bashir
- Abstract summary: MASAKHANE is an open-source, continent-wide, distributed, online research effort for machine translation for African languages.
We discuss our methodology for building the community and spurring research from the African continent.
- Score: 16.66010516114378
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Africa has over 2000 languages. Despite this, African languages account for a
small portion of available resources and publications in Natural Language
Processing (NLP). This is due to multiple factors, including: a lack of focus
from government and funding, discoverability, a lack of community, sheer
language complexity, difficulty in reproducing papers and no benchmarks to
compare techniques. To begin to address the identified problems, MASAKHANE, an
open-source, continent-wide, distributed, online research effort for machine
translation for African languages, was founded. In this paper, we discuss our
methodology for building the community and spurring research from the African
continent, as well as outline the success of the community in terms of
addressing the identified problems affecting African NLP.
Related papers
- The Ghanaian NLP Landscape: A First Look [9.17372840572907]
Ghanaian languages, in particular, face an alarming decline, with documented extinction and several at risk.
This study pioneers a comprehensive survey of Natural Language Processing (NLP) research focused on Ghanaian languages.
arXiv Detail & Related papers (2024-05-10T21:39:09Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Towards Afrocentric NLP for African Languages: Where We Are and Where We
Can Go [7.893831644671974]
Situating African languages in a typological framework, we discuss how the particulars of these languages can be harnessed.
Our main objective is to motivate and advocate for an Afrocentric approach to technology development.
arXiv Detail & Related papers (2022-03-16T02:14:57Z) - MasakhaNER: Named Entity Recognition for African Languages [48.34339599387944]
We create the first large publicly available high-quality dataset for named entity recognition in ten African languages.
We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER.
arXiv Detail & Related papers (2021-03-22T13:12:44Z) - The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z) - Lanfrica: A Participatory Approach to Documenting Machine Translation
Research on African Languages [0.012691047660244334]
Africa has the highest language diversity, with 1500-2000 documented languages and many more undocumented or extinct languages.
This makes it hard to keep track of the MT research, models and dataset that have been developed for some of them.
Online platforms can be useful creating accessibility to researches, benchmarks and datasets in these African languages.
arXiv Detail & Related papers (2020-08-03T18:14:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.