Central Kurdish machine translation: First large scale parallel corpus
and experiments
- URL: http://arxiv.org/abs/2106.09325v1
- Date: Thu, 17 Jun 2021 08:41:53 GMT
- Title: Central Kurdish machine translation: First large scale parallel corpus
and experiments
- Authors: Zhila Amini, Mohammad Mohammadamini (LIA), Hawre Hosseini, Mehran
Mansouri, Daban Jaff
- Abstract summary: We present the first large scale parallel corpus of Central Kurdish-English, Awta, containing 229,222 pairs of manually aligned translations.
Our best performing systems achieve 22.72 and 16.81 in BLEU score for Ku$rightarrow$EN and En$rightarrow$Ku, respectively.
- Score: 2.099922236065961
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While the computational processing of Kurdish has experienced a relative
increase, the machine translation of this language seems to be lacking a
considerable body of scientific work. This is in part due to the lack of
resources especially curated for this task. In this paper, we present the first
large scale parallel corpus of Central Kurdish-English, Awta, containing
229,222 pairs of manually aligned translations. Our corpus is collected from
different text genres and domains in an attempt to build more robust and
real-world applications of machine translation. We make a portion of this
corpus publicly available in order to foster research in this area. Further, we
build several neural machine translation models in order to benchmark the task
of Kurdish machine translation. Additionally, we perform extensive experimental
analysis of results in order to identify the major challenges that Central
Kurdish machine translation faces. These challenges include language-dependent
and-independent ones as categorized in this paper, the first group of which are
aware of Central Kurdish linguistic properties on different morphological,
syntactic and semantic levels. Our best performing systems achieve 22.72 and
16.81 in BLEU score for Ku$\rightarrow$EN and En$\rightarrow$Ku, respectively.
Related papers
- Shifting from endangerment to rebirth in the Artificial Intelligence Age: An Ensemble Machine Learning Approach for Hawrami Text Classification [1.174020933567308]
Hawrami, a dialect of Kurdish, is classified as an endangered language.
This paper introduces various text classification models using a dataset of 6,854 articles in Hawrami labeled into 15 categories by two native speakers.
arXiv Detail & Related papers (2024-09-25T12:52:21Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - The Effect of Normalization for Bi-directional Amharic-English Neural
Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset.
We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model.
The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z) - A Bilingual Parallel Corpus with Discourse Annotations [82.07304301996562]
This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
arXiv Detail & Related papers (2022-10-26T12:33:53Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - Building the Language Resource for a Cebuano-Filipino Neural Machine
Translation System [0.0]
We present the efforts made to build a parallel corpus for Cebuano and Filipino from two different domains: biblical texts and the web.
For the biblical resource, subword unit translation for verbs and copy-able approach for nouns were applied to correct inconsistencies in the translation.
For Wikipedia, commonly occurring topic segments were extracted from both the source and the target languages.
arXiv Detail & Related papers (2021-10-05T23:03:09Z) - Extended Parallel Corpus for Amharic-English Machine Translation [0.0]
It will be useful for machine translation of an under-resourced language, Amharic.
We trained neural machine translation and phrase-based statistical machine translation models using the corpus.
arXiv Detail & Related papers (2021-04-08T06:51:08Z) - Towards Machine Translation for the Kurdish Language [0.0]
Machine translation is the task of translating texts from one language to another using computers.
Kurdish, an Indo-European language, has received little attention in this realm due to the language being less-resourced.
We describe the available scarce parallel data suitable for training a neural machine translation model for Sorani Kurdish-English translation.
arXiv Detail & Related papers (2020-10-12T21:28:57Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Leveraging Multilingual News Websites for Building a Kurdish Parallel
Corpus [0.6445605125467573]
We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji.
We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani.
arXiv Detail & Related papers (2020-10-04T11:52:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.