Crowdsourcing Parallel Corpus for English-Oromo Neural Machine
Translation using Community Engagement Platform
- URL: http://arxiv.org/abs/2102.07539v1
- Date: Mon, 15 Feb 2021 13:22:30 GMT
- Title: Crowdsourcing Parallel Corpus for English-Oromo Neural Machine
Translation using Community Engagement Platform
- Authors: Sisay Chala, Bekele Debisa, Amante Diriba, Silas Getachew, Chala Getu,
Solomon Shiferaw
- Abstract summary: The paper deals with implementing a translation of English to Afaan Oromo and vice versa using Neural Machine Translation.
Using a bilingual corpus of just over 40k sentence pairs we have collected, this study showed a promising result.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Even though Afaan Oromo is the most widely spoken language in the Cushitic
family by more than fifty million people in the Horn and East Africa, it is
surprisingly resource-scarce from a technological point of view. The increasing
amount of various useful documents written in English language brings to
investigate the machine that can translate those documents and make it easily
accessible for local language. The paper deals with implementing a translation
of English to Afaan Oromo and vice versa using Neural Machine Translation. But
the implementation is not very well explored due to the limited amount and
diversity of the corpus. However, using a bilingual corpus of just over 40k
sentence pairs we have collected, this study showed a promising result. About a
quarter of this corpus is collected via Community Engagement Platform (CEP)
that was implemented to enrich the parallel corpus through crowdsourcing
translations.
Related papers
- Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - A Bilingual Parallel Corpus with Discourse Annotations [82.07304301996562]
This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
arXiv Detail & Related papers (2022-10-26T12:33:53Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Lahjoita puhetta -- a large-scale corpus of spoken Finnish with some
benchmarks [9.160401226886947]
The Donate Speech campaign has so far succeeded in gathering approximately 3600 hours of ordinary, colloquial Finnish speech.
The primary goals of the collection were to create a representative, large-scale resource to study spontaneous spoken Finnish and to accelerate the development of language technology and speech-based services.
We present the collection process and the collected corpus, and showcase its versatility through multiple use cases.
arXiv Detail & Related papers (2022-03-24T07:50:25Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Leveraging Multilingual News Websites for Building a Kurdish Parallel
Corpus [0.6445605125467573]
We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji.
We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani.
arXiv Detail & Related papers (2020-10-04T11:52:50Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z) - PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine
Translation [1.2301855531996841]
This paper presents a parallel corpus of the 13,738 code-mixed English-Hindi sentences and their corresponding translation in English.
The translations of sentences are done manually by the annotators.
We are releasing the parallel corpus to facilitate future research opportunities in code-mixed machine translation.
arXiv Detail & Related papers (2020-04-20T17:04:22Z) - Neural Machine Translation for Low-Resourced Indian Languages [4.726777092009554]
Machine translation is an effective approach to convert text to a different language without any human involvement.
In this paper, we have applied NMT on two of the most morphological rich Indian languages, i.e. English-Tamil and English-Malayalam.
We proposed a novel NMT model using Multihead self-attention along with pre-trained Byte-Pair-Encoded (BPE) and MultiBPE embeddings to develop an efficient translation system.
arXiv Detail & Related papers (2020-04-19T17:29:34Z) - Investigating Language Impact in Bilingual Approaches for Computational
Language Documentation [28.838960956506018]
This paper investigates how the choice of translation language affects the posterior documentation work.
We create 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment.
Our results suggest that incorporating clues into the neural models' input representation increases their translation and alignment quality.
arXiv Detail & Related papers (2020-03-30T10:30:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.