Leveraging Multilingual News Websites for Building a Kurdish Parallel
Corpus
- URL: http://arxiv.org/abs/2010.01554v1
- Date: Sun, 4 Oct 2020 11:52:50 GMT
- Title: Leveraging Multilingual News Websites for Building a Kurdish Parallel
Corpus
- Authors: Sina Ahmadi, Hossein Hassani, Daban Q. Jaff
- Abstract summary: We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji.
We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani.
- Score: 0.6445605125467573
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Machine translation has been a major motivation of development in natural
language processing. Despite the burgeoning achievements in creating more
efficient machine translation systems thanks to deep learning methods, parallel
corpora have remained indispensable for progress in the field. In an attempt to
create parallel corpora for the Kurdish language, in this paper, we describe
our approach in retrieving potentially-alignable news articles from
multi-language websites and manually align them across dialects and languages
based on lexical similarity and transliteration of scripts. We present a corpus
containing 12,327 translation pairs in the two major dialects of Kurdish,
Sorani and Kurmanji. We also provide 1,797 and 650 translation pairs in
English-Kurmanji and English-Sorani. The corpus is publicly available under the
CC BY-NC-SA 4.0 license.
Related papers
- Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - Enhancing Language Learning through Technology: Introducing a New English-Azerbaijani (Arabic Script) Parallel Corpus [0.9051256541674136]
This paper introduces a pioneering English-Azerbaijani (Arabic Script) parallel corpus.
It is designed to bridge the technological gap in language learning and machine translation for under-resourced languages.
arXiv Detail & Related papers (2024-07-06T21:23:20Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Language and Speech Technology for Central Kurdish Varieties [27.751434601712]
Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum.
Previous studies addressing language and speech technology for Kurdish handle it in a monolithic way as a macro-language.
In this paper, we take a step towards developing resources for language and speech technology for varieties of Central Kurdish.
arXiv Detail & Related papers (2024-03-04T12:27:32Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - SAHAAYAK 2023 -- the Multi Domain Bilingual Parallel Corpus of Sanskrit
to Hindi for Machine Translation [0.0]
The corpus contains total of 1.5M sentence pairs between Sanskrit and Hindi.
Data from multiple domain has been incorporated into the corpus that includes, News, Daily conversations, Politics, History, Sport, and Ancient Indian Literature.
arXiv Detail & Related papers (2023-06-27T11:06:44Z) - Central Kurdish machine translation: First large scale parallel corpus
and experiments [2.099922236065961]
We present the first large scale parallel corpus of Central Kurdish-English, Awta, containing 229,222 pairs of manually aligned translations.
Our best performing systems achieve 22.72 and 16.81 in BLEU score for Ku$rightarrow$EN and En$rightarrow$Ku, respectively.
arXiv Detail & Related papers (2021-06-17T08:41:53Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Towards Machine Translation for the Kurdish Language [0.0]
Machine translation is the task of translating texts from one language to another using computers.
Kurdish, an Indo-European language, has received little attention in this realm due to the language being less-resourced.
We describe the available scarce parallel data suitable for training a neural machine translation model for Sorani Kurdish-English translation.
arXiv Detail & Related papers (2020-10-12T21:28:57Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.