Related papers: Towards Neural Machine Translation for Edoid Languages

Towards Neural Machine Translation for Edoid Languages

URL: http://arxiv.org/abs/2003.10704v1
Date: Tue, 24 Mar 2020 07:53:41 GMT
Title: Towards Neural Machine Translation for Edoid Languages
Authors: Iroro Orife
Abstract summary: Many Nigerian languages have relinquished their previous prestige and purpose in modern society to English and Nigerian Pidgin. This work explores the feasibility of Neural Machine Translation for the Edoid language family of Southern Nigeria.
Score: 2.144787054581292
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many Nigerian languages have relinquished their previous prestige and purpose in modern society to English and Nigerian Pidgin. For the millions of L1 speakers of indigenous languages, there are inequalities that manifest themselves as unequal access to information, communications, health care, security as well as attenuated participation in political and civic life. To minimize exclusion and promote socio-linguistic and economic empowerment, this work explores the feasibility of Neural Machine Translation (NMT) for the Edoid language family of Southern Nigeria. Using the new JW300 public dataset, we trained and evaluated baseline translation models for four widely spoken languages in this group: \`Ed\'o, \'Es\'an, Urhobo and Isoko. Trained models, code and datasets have been open-sourced to advance future research efforts on Edoid language technology.

Related papers

Ibom NLP: A Step Toward Inclusive Natural Language Processing for Nigeria's Minority Languages [5.5078606217036965]
Nigeria is the most populous country in Africa with a population of more than 200 million people.<n>More than 500 languages are spoken in Nigeria and it is one of the most linguistically diverse countries in the world.<n>Despite this, natural language processing (NLP) research has mostly focused on the following four languages: Hausa, Igbo, Nigerian-Pidgin, and Yoruba.
arXiv Detail & Related papers (2025-11-09T20:33:39Z)
Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo [0.815557531820863]
This paper presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw'ida, Kalenjin, and Dholuo. Our project employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages. We made these resources freely accessible via open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets.
arXiv Detail & Related papers (2025-01-19T10:17:21Z)
Harnessing the Power of Artificial Intelligence to Vitalize Endangered Indigenous Languages: Technologies and Experiences [31.62071644137294]
We discuss the decreasing diversity of languages in the world and how working with Indigenous languages poses unique ethical challenges for AI and NLP. We report encouraging results in the development of high-quality machine learning translators for Indigenous languages. We present prototypes we have built in projects done in 2023 and 2024 with Indigenous communities in Brazil, aimed at facilitating writing.
arXiv Detail & Related papers (2024-07-17T14:46:37Z)
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z)
EthioMT: Parallel Corpus for Low-resource Ethiopian Languages [49.80726355048843]
We introduce EthioMT -- a new parallel corpus for 15 languages. We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia. We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.
arXiv Detail & Related papers (2024-03-28T12:26:45Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any. We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z)
Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP. We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba. Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region. All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z)
No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z)
The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages. It is estimated that over 100 million people speak the language. We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z)
Towards End-to-End Training of Automatic Speech Recognition for Nigerian Pidgin [0.0]
Nigerian pidgin is one of the most popular languages in West Africa. We present the first parallel (speech-to-text) data on Nigerian pidgin. We also trained the first end-to-end speech recognition system on this language.
arXiv Detail & Related papers (2020-10-21T16:32:58Z)
Towards Supervised and Unsupervised Neural Machine Translation Baselines for Nigerian Pidgin [0.2792030485253753]
Nigerian Pidgin is arguably the most widely spoken language in Nigeria. Variants of this language are also spoken across West and Central Africa. This work aims to establish supervised and unsupervised neural machine translation baselines between English and Nigerian Pidgin.
arXiv Detail & Related papers (2020-03-27T22:40:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.