A Large-Scale Study of Machine Translation in the Turkic Languages
- URL: http://arxiv.org/abs/2109.04593v1
- Date: Thu, 9 Sep 2021 23:56:30 GMT
- Title: A Large-Scale Study of Machine Translation in the Turkic Languages
- Authors: Jamshidbek Mirzakhalov, Anoop Babu, Duygu Ataman, Sherzod Kariev,
Francis Tyers, Otabek Abduraufov, Mammad Hajili, Sardana Ivanova, Abror
Khaytbaev, Antonio Laverghetta Jr., Behzodbek Moydinboyev, Esra Onal,
Shaxnoza Pulatova, Ahsan Wahab, Orhan Firat, Sriram Chellappan
- Abstract summary: Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems.
However, there is still a large number of languages that are yet to reap the benefits of NMT.
This paper provides the first large-scale case study of the practical application of MT in the Turkic language family.
- Score: 7.3458368273762815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in neural machine translation (NMT) have pushed the quality
of machine translation systems to the point where they are becoming widely
adopted to build competitive systems. However, there is still a large number of
languages that are yet to reap the benefits of NMT. In this paper, we provide
the first large-scale case study of the practical application of MT in the
Turkic language family in order to realize the gains of NMT for Turkic
languages under high-resource to extremely low-resource scenarios. In addition
to presenting an extensive analysis that identifies the bottlenecks towards
building competitive systems to ameliorate data scarcity, our study has several
key contributions, including, i) a large parallel corpus covering 22 Turkic
languages consisting of common public datasets in combination with new datasets
of approximately 2 million parallel sentences, ii) bilingual baselines for 26
language pairs, iii) novel high-quality test sets in three different
translation domains and iv) human evaluation scores. All models, scripts, and
data will be released to the public.
Related papers
- EthioMT: Parallel Corpus for Low-resource Ethiopian Languages [49.80726355048843]
We introduce EthioMT -- a new parallel corpus for 15 languages.
We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia.
We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.
arXiv Detail & Related papers (2024-03-28T12:26:45Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Improving Multilingual Neural Machine Translation System for Indic
Languages [0.0]
We propose a multilingual neural machine translation (MNMT) system to address the issues related to low-resource language translation.
A state-of-the-art transformer architecture is used to realize the proposed model.
Trials over a good amount of data reveal its superiority over the conventional models.
arXiv Detail & Related papers (2022-09-27T09:51:56Z) - Towards the Next 1000 Languages in Multilingual Machine Translation:
Exploring the Synergy Between Supervised and Self-Supervised Learning [48.15259834021655]
We present a pragmatic approach towards building a multilingual machine translation model that covers hundreds of languages.
We use a mixture of supervised and self-supervised objectives, depending on the data availability for different language pairs.
We demonstrate that the synergy between these two training paradigms enables the model to produce high-quality translations in the zero-resource setting.
arXiv Detail & Related papers (2022-01-09T23:36:44Z) - Evaluating Multiway Multilingual NMT in the Turkic Languages [11.605271847666005]
We present an evaluation of state-of-the-art approaches to training and evaluating machine translation systems in 22 languages from the Turkic language family.
We train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations.
We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost.
arXiv Detail & Related papers (2021-09-13T19:01:07Z) - Majority Voting with Bidirectional Pre-translation For Bitext Retrieval [2.580271290008534]
A popular approach has been to mine so-called "pseudo-parallel" sentences from paired documents in two languages.
In this paper, we outline some problems with current methods, propose computationally economical solutions to those problems, and demonstrate success with novel methods.
We make the code and data used for our experiments publicly available.
arXiv Detail & Related papers (2021-03-10T22:24:01Z) - Improving Multilingual Neural Machine Translation For Low-Resource
Languages: French-, English- Vietnamese [4.103253352106816]
This paper proposes two simple strategies to address the rare word issue in multilingual MT systems for two low-resource language pairs: French-Vietnamese and English-Vietnamese.
We have shown significant improvements of up to +1.62 and +2.54 BLEU points over the bilingual baseline systems for both language pairs.
arXiv Detail & Related papers (2020-12-16T04:43:43Z) - Leveraging Monolingual Data with Self-Supervision for Multilingual
Neural Machine Translation [54.52971020087777]
Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models.
Self-supervision improves zero-shot translation quality in multilingual models.
We get up to 33 BLEU on ro-en translation without any parallel data or back-translation.
arXiv Detail & Related papers (2020-05-11T00:20:33Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z) - Pre-training via Leveraging Assisting Languages and Data Selection for
Neural Machine Translation [49.51278300110449]
We propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the languages of interest.
A case study of low-resource Japanese-English neural machine translation (NMT) reveals that leveraging large Chinese and French monolingual corpora can help overcome the shortage of Japanese and English monolingual corpora.
arXiv Detail & Related papers (2020-01-23T02:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.