Igbo-English Machine Translation: An Evaluation Benchmark
- URL: http://arxiv.org/abs/2004.00648v1
- Date: Wed, 1 Apr 2020 18:06:21 GMT
- Title: Igbo-English Machine Translation: An Evaluation Benchmark
- Authors: Ignatius Ezeani, Paul Rayson, Ikechukwu Onyenwe, Chinedu Uchechukwu,
Mark Hepple
- Abstract summary: We discuss our effort toward building a standard machine translation benchmark dataset for Igbo.
Igbo is spoken by more than 50 million people globally with over 50% of the speakers are in southeastern Nigeria.
- Score: 3.0151383439513753
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Although researchers and practitioners are pushing the boundaries and
enhancing the capacities of NLP tools and methods, works on African languages
are lagging. A lot of focus on well resourced languages such as English,
Japanese, German, French, Russian, Mandarin Chinese etc. Over 97% of the
world's 7000 languages, including African languages, are low resourced for NLP
i.e. they have little or no data, tools, and techniques for NLP research. For
instance, only 5 out of 2965, 0.19% authors of full text papers in the ACL
Anthology extracted from the 5 major conferences in 2018 ACL, NAACL, EMNLP,
COLING and CoNLL, are affiliated to African institutions. In this work, we
discuss our effort toward building a standard machine translation benchmark
dataset for Igbo, one of the 3 major Nigerian languages. Igbo is spoken by more
than 50 million people globally with over 50% of the speakers are in
southeastern Nigeria. Igbo is low resourced although there have been some
efforts toward developing IgboNLP such as part of speech tagging and diacritic
restoration
Related papers
- Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments [0.9214083577876088]
This paper creates approximately 1 million human-translated words of new benchmark data in 8 low-resource African languages.
Our benchmarks are translations of Winogrande and three sections of MMLU: college medicine, clinical knowledge, and virology.
Using the benchmarks translated, we report previously unknown performance gaps between state-of-the-art (SOTA) LLMs in English and African languages.
arXiv Detail & Related papers (2024-12-16T23:50:21Z) - Harnessing Transfer Learning from Swahili: Advancing Solutions for Comorian Dialects [0.0]
We aim to pioneer NLP technologies for Comorian, a group of four languages or dialects belonging to the Bantu family.
Our approach is motivated by the hypothesis that if a human can understand a different language from their native language with little or no effort, it would be entirely possible to model this process on a machine.
arXiv Detail & Related papers (2024-12-09T22:47:41Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Towards End-to-End Training of Automatic Speech Recognition for Nigerian
Pidgin [0.0]
Nigerian pidgin is one of the most popular languages in West Africa.
We present the first parallel (speech-to-text) data on Nigerian pidgin.
We also trained the first end-to-end speech recognition system on this language.
arXiv Detail & Related papers (2020-10-21T16:32:58Z) - Lanfrica: A Participatory Approach to Documenting Machine Translation
Research on African Languages [0.012691047660244334]
Africa has the highest language diversity, with 1500-2000 documented languages and many more undocumented or extinct languages.
This makes it hard to keep track of the MT research, models and dataset that have been developed for some of them.
Online platforms can be useful creating accessibility to researches, benchmarks and datasets in these African languages.
arXiv Detail & Related papers (2020-08-03T18:14:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.