Igbo-English Machine Translation: An Evaluation Benchmark
- URL: http://arxiv.org/abs/2004.00648v1
- Date: Wed, 1 Apr 2020 18:06:21 GMT
- Title: Igbo-English Machine Translation: An Evaluation Benchmark
- Authors: Ignatius Ezeani, Paul Rayson, Ikechukwu Onyenwe, Chinedu Uchechukwu,
Mark Hepple
- Abstract summary: We discuss our effort toward building a standard machine translation benchmark dataset for Igbo.
Igbo is spoken by more than 50 million people globally with over 50% of the speakers are in southeastern Nigeria.
- Score: 3.0151383439513753
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Although researchers and practitioners are pushing the boundaries and
enhancing the capacities of NLP tools and methods, works on African languages
are lagging. A lot of focus on well resourced languages such as English,
Japanese, German, French, Russian, Mandarin Chinese etc. Over 97% of the
world's 7000 languages, including African languages, are low resourced for NLP
i.e. they have little or no data, tools, and techniques for NLP research. For
instance, only 5 out of 2965, 0.19% authors of full text papers in the ACL
Anthology extracted from the 5 major conferences in 2018 ACL, NAACL, EMNLP,
COLING and CoNLL, are affiliated to African institutions. In this work, we
discuss our effort toward building a standard machine translation benchmark
dataset for Igbo, one of the 3 major Nigerian languages. Igbo is spoken by more
than 50 million people globally with over 50% of the speakers are in
southeastern Nigeria. Igbo is low resourced although there have been some
efforts toward developing IgboNLP such as part of speech tagging and diacritic
restoration
Related papers
- Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models [18.260317326787035]
This paper introduces IrokoBench -- a human-translated benchmark dataset for 16 typologically-diverse low-resource African languages.
We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings(where test sets are translated into English) across 10 open and four proprietary language models.
We observe a significant performance gap between open and proprietary models, with the highest performing open model, Aya-101 only at 58% of the best-performing proprietary model GPT-4o performance.
arXiv Detail & Related papers (2024-06-05T15:23:08Z) - CCAE: A Corpus of Chinese-based Asian Englishes [8.563253881619124]
This paper represents one of the few initial efforts to utilize the NLP technology in the paradigm of World Englishes.
We present an overview of the CCAE -- Corpus of Chinese-based Asian English, a suite of corpora comprising six Chinese-based Asian English varieties.
arXiv Detail & Related papers (2023-10-09T03:34:15Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Towards End-to-End Training of Automatic Speech Recognition for Nigerian
Pidgin [0.0]
Nigerian pidgin is one of the most popular languages in West Africa.
We present the first parallel (speech-to-text) data on Nigerian pidgin.
We also trained the first end-to-end speech recognition system on this language.
arXiv Detail & Related papers (2020-10-21T16:32:58Z) - Lanfrica: A Participatory Approach to Documenting Machine Translation
Research on African Languages [0.012691047660244334]
Africa has the highest language diversity, with 1500-2000 documented languages and many more undocumented or extinct languages.
This makes it hard to keep track of the MT research, models and dataset that have been developed for some of them.
Online platforms can be useful creating accessibility to researches, benchmarks and datasets in these African languages.
arXiv Detail & Related papers (2020-08-03T18:14:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.