Lanfrica: A Participatory Approach to Documenting Machine Translation
Research on African Languages
- URL: http://arxiv.org/abs/2008.07302v1
- Date: Mon, 3 Aug 2020 18:14:04 GMT
- Title: Lanfrica: A Participatory Approach to Documenting Machine Translation
Research on African Languages
- Authors: Chris C. Emezue and Bonaventure F.P. Dossou
- Abstract summary: Africa has the highest language diversity, with 1500-2000 documented languages and many more undocumented or extinct languages.
This makes it hard to keep track of the MT research, models and dataset that have been developed for some of them.
Online platforms can be useful creating accessibility to researches, benchmarks and datasets in these African languages.
- Score: 0.012691047660244334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Over the years, there have been campaigns to include the African languages in
the growing research on machine translation (MT) in particular, and natural
language processing (NLP) in general. Africa has the highest language
diversity, with 1500-2000 documented languages and many more undocumented or
extinct languages(Lewis, 2009; Bendor-Samuel, 2017). This makes it hard to keep
track of the MT research, models and dataset that have been developed for some
of them. As the internet and social media make up the daily lives of more than
half of the world(Lin, 2020), as well as over 40% of Africans(Campbell, 2019),
online platforms can be useful in creating accessibility to researches,
benchmarks and datasets in these African languages, thereby improving
reproducibility and sharing of existing research and their results. In this
paper, we introduce Lanfrica, a novel, on-going framework that employs a
participatory approach to documenting researches, projects, benchmarks and
dataset on African languages.
Related papers
- EthioMT: Parallel Corpus for Low-resource Ethiopian Languages [49.80726355048843]
We introduce EthioMT -- a new parallel corpus for 15 languages.
We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia.
We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.
arXiv Detail & Related papers (2024-03-28T12:26:45Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - AI4D -- African Language Program [0.21960481478626018]
This work details the AI4D - African Language Program, a 3-part project that incentivised the crowd-sourcing, collection and curation of language datasets.
Key outcomes of the work so far include 1) the creation of 9+ open source, African language datasets annotated for a variety of ML tasks, and 2) the creation of baseline models for these datasets.
arXiv Detail & Related papers (2021-04-06T13:51:16Z) - MasakhaNER: Named Entity Recognition for African Languages [48.34339599387944]
We create the first large publicly available high-quality dataset for named entity recognition in ten African languages.
We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER.
arXiv Detail & Related papers (2021-03-22T13:12:44Z) - The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z) - Masakhane -- Machine Translation For Africa [16.66010516114378]
MASAKHANE is an open-source, continent-wide, distributed, online research effort for machine translation for African languages.
We discuss our methodology for building the community and spurring research from the African continent.
arXiv Detail & Related papers (2020-03-13T09:01:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.