Related papers: Igbo-English Machine Translation: An Evaluation Benchmark

Igbo-English Machine Translation: An Evaluation Benchmark

URL: http://arxiv.org/abs/2004.00648v1
Date: Wed, 1 Apr 2020 18:06:21 GMT
Title: Igbo-English Machine Translation: An Evaluation Benchmark
Authors: Ignatius Ezeani, Paul Rayson, Ikechukwu Onyenwe, Chinedu Uchechukwu, Mark Hepple
Abstract summary: We discuss our effort toward building a standard machine translation benchmark dataset for Igbo. Igbo is spoken by more than 50 million people globally with over 50% of the speakers are in southeastern Nigeria.
Score: 3.0151383439513753
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Although researchers and practitioners are pushing the boundaries and enhancing the capacities of NLP tools and methods, works on African languages are lagging. A lot of focus on well resourced languages such as English, Japanese, German, French, Russian, Mandarin Chinese etc. Over 97% of the world's 7000 languages, including African languages, are low resourced for NLP i.e. they have little or no data, tools, and techniques for NLP research. For instance, only 5 out of 2965, 0.19% authors of full text papers in the ACL Anthology extracted from the 5 major conferences in 2018 ACL, NAACL, EMNLP, COLING and CoNLL, are affiliated to African institutions. In this work, we discuss our effort toward building a standard machine translation benchmark dataset for Igbo, one of the 3 major Nigerian languages. Igbo is spoken by more than 50 million people globally with over 50% of the speakers are in southeastern Nigeria. Igbo is low resourced although there have been some efforts toward developing IgboNLP such as part of speech tagging and diacritic restoration

Related papers

Ibom NLP: A Step Toward Inclusive Natural Language Processing for Nigeria's Minority Languages [5.5078606217036965]
Nigeria is the most populous country in Africa with a population of more than 200 million people.<n>More than 500 languages are spoken in Nigeria and it is one of the most linguistically diverse countries in the world.<n>Despite this, natural language processing (NLP) research has mostly focused on the following four languages: Hausa, Igbo, Nigerian-Pidgin, and Yoruba.
arXiv Detail & Related papers (2025-11-09T20:33:39Z)
Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments [0.9214083577876088]
This paper creates approximately 1 million human-translated words of new benchmark data in 8 low-resource African languages. Our benchmarks are translations of Winogrande and three sections of MMLU: college medicine, clinical knowledge, and virology. Using the benchmarks translated, we report previously unknown performance gaps between state-of-the-art (SOTA) LLMs in English and African languages.
arXiv Detail & Related papers (2024-12-16T23:50:21Z)
Harnessing Transfer Learning from Swahili: Advancing Solutions for Comorian Dialects [0.0]
We aim to pioneer NLP technologies for Comorian, a group of four languages or dialects belonging to the Bantu family. Our approach is motivated by the hypothesis that if a human can understand a different language from their native language with little or no effort, it would be entirely possible to model this process on a machine.
arXiv Detail & Related papers (2024-12-09T22:47:41Z)
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z)
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models [18.260317326787035]
This paper introduces IrokoBench -- a human-translated benchmark dataset for 16 typologically-diverse low-resource African languages. We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings(where test sets are translated into English) across 10 open and four proprietary language models. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Aya-101 only at 58% of the best-performing proprietary model GPT-4o performance.
arXiv Detail & Related papers (2024-06-05T15:23:08Z)
AfroBench: How Good are Large Language Models on African Languages? [55.35674466745322]
AfroBench is a benchmark for evaluating the performance of LLMs across 64 African languages. AfroBench consists of nine natural language understanding datasets, six text generation datasets, six knowledge and question answering tasks, and one mathematical reasoning task.
arXiv Detail & Related papers (2023-11-14T08:10:14Z)
CCAE: A Corpus of Chinese-based Asian Englishes [8.563253881619124]
This paper represents one of the few initial efforts to utilize the NLP technology in the paradigm of World Englishes. We present an overview of the CCAE -- Corpus of Chinese-based Asian English, a suite of corpora comprising six Chinese-based Asian English varieties.
arXiv Detail & Related papers (2023-10-09T03:34:15Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents. Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z)
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development. We create the largest human-annotated NER dataset for 20 African languages. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z)
naab: A ready-to-use plug-and-play corpus for Farsi [1.381198851698147]
naab is the largest publicly available, cleaned, and ready-to-use Farsi textual corpus.<n>Naab consists of 130GB of data, comprising over 250 million paragraphs and 15 billion words.<n>Naab-raw is an unprocessed version of the dataset, along with a pre-processing toolkit.
arXiv Detail & Related papers (2022-08-29T10:40:58Z)
No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z)
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia. Most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z)
Towards End-to-End Training of Automatic Speech Recognition for Nigerian Pidgin [0.0]
Nigerian pidgin is one of the most popular languages in West Africa. We present the first parallel (speech-to-text) data on Nigerian pidgin. We also trained the first end-to-end speech recognition system on this language.
arXiv Detail & Related papers (2020-10-21T16:32:58Z)
Lanfrica: A Participatory Approach to Documenting Machine Translation Research on African Languages [0.012691047660244334]
Africa has the highest language diversity, with 1500-2000 documented languages and many more undocumented or extinct languages. This makes it hard to keep track of the MT research, models and dataset that have been developed for some of them. Online platforms can be useful creating accessibility to researches, benchmarks and datasets in these African languages.
arXiv Detail & Related papers (2020-08-03T18:14:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.