Towards Neural Machine Translation for Edoid Languages
- URL: http://arxiv.org/abs/2003.10704v1
- Date: Tue, 24 Mar 2020 07:53:41 GMT
- Title: Towards Neural Machine Translation for Edoid Languages
- Authors: Iroro Orife
- Abstract summary: Many Nigerian languages have relinquished their previous prestige and purpose in modern society to English and Nigerian Pidgin.
This work explores the feasibility of Neural Machine Translation for the Edoid language family of Southern Nigeria.
- Score: 2.144787054581292
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many Nigerian languages have relinquished their previous prestige and purpose
in modern society to English and Nigerian Pidgin. For the millions of L1
speakers of indigenous languages, there are inequalities that manifest
themselves as unequal access to information, communications, health care,
security as well as attenuated participation in political and civic life. To
minimize exclusion and promote socio-linguistic and economic empowerment, this
work explores the feasibility of Neural Machine Translation (NMT) for the Edoid
language family of Southern Nigeria. Using the new JW300 public dataset, we
trained and evaluated baseline translation models for four widely spoken
languages in this group: \`Ed\'o, \'Es\'an, Urhobo and Isoko. Trained models,
code and datasets have been open-sourced to advance future research efforts on
Edoid language technology.
Related papers
- Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo [0.815557531820863]
This paper presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw'ida, Kalenjin, and Dholuo.
Our project employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages.
We made these resources freely accessible via open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets.
arXiv Detail & Related papers (2025-01-19T10:17:21Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - EthioMT: Parallel Corpus for Low-resource Ethiopian Languages [49.80726355048843]
We introduce EthioMT -- a new parallel corpus for 15 languages.
We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia.
We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.
arXiv Detail & Related papers (2024-03-28T12:26:45Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z) - Towards End-to-End Training of Automatic Speech Recognition for Nigerian
Pidgin [0.0]
Nigerian pidgin is one of the most popular languages in West Africa.
We present the first parallel (speech-to-text) data on Nigerian pidgin.
We also trained the first end-to-end speech recognition system on this language.
arXiv Detail & Related papers (2020-10-21T16:32:58Z) - Towards Supervised and Unsupervised Neural Machine Translation Baselines
for Nigerian Pidgin [0.2792030485253753]
Nigerian Pidgin is arguably the most widely spoken language in Nigeria. Variants of this language are also spoken across West and Central Africa.
This work aims to establish supervised and unsupervised neural machine translation baselines between English and Nigerian Pidgin.
arXiv Detail & Related papers (2020-03-27T22:40:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.