Evaluating Multiway Multilingual NMT in the Turkic Languages
- URL: http://arxiv.org/abs/2109.06262v1
- Date: Mon, 13 Sep 2021 19:01:07 GMT
- Title: Evaluating Multiway Multilingual NMT in the Turkic Languages
- Authors: Jamshidbek Mirzakhalov, Anoop Babu, Aigiz Kunafin, Ahsan Wahab, Behzod
Moydinboyev, Sardana Ivanova, Mokhiyakhon Uzokova, Shaxnoza Pulatova, Duygu
Ataman, Julia Kreutzer, Francis Tyers, Orhan Firat, John Licato, Sriram
Chellappan
- Abstract summary: We present an evaluation of state-of-the-art approaches to training and evaluating machine translation systems in 22 languages from the Turkic language family.
We train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations.
We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost.
- Score: 11.605271847666005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the increasing number of large and comprehensive machine translation
(MT) systems, evaluation of these methods in various languages has been
restrained by the lack of high-quality parallel corpora as well as engagement
with the people that speak these languages. In this study, we present an
evaluation of state-of-the-art approaches to training and evaluating MT systems
in 22 languages from the Turkic language family, most of which being extremely
under-explored. First, we adopt the TIL Corpus with a few key improvements to
the training and the evaluation sets. Then, we train 26 bilingual baselines as
well as a multi-way neural MT (MNMT) model using the corpus and perform an
extensive analysis using automatic metrics as well as human evaluations. We
find that the MNMT model outperforms almost all bilingual baselines in the
out-of-domain test sets and finetuning the model on a downstream task of a
single pair also results in a huge performance boost in both low- and
high-resource scenarios. Our attentive analysis of evaluation criteria for MT
models in Turkic languages also points to the necessity for further research in
this direction. We release the corpus splits, test sets as well as models to
the public.
Related papers
- Revisiting Machine Translation for Cross-lingual Classification [91.43729067874503]
Most research in the area focuses on the multilingual models rather than the Machine Translation component.
We show that, by using a stronger MT system and mitigating the mismatch between training on original text and running inference on machine translated text, translate-test can do substantially better than previously assumed.
arXiv Detail & Related papers (2023-05-23T16:56:10Z) - Evaluating and Improving the Coreference Capabilities of Machine
Translation Models [30.60934078720647]
Machine translation requires a wide range of linguistic capabilities.
Current end-to-end models are expected to learn implicitly by observing aligned sentences in bilingual corpora.
arXiv Detail & Related papers (2023-02-16T18:16:09Z) - High-resource Language-specific Training for Multilingual Neural Machine
Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference.
Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder.
HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - A Large-Scale Study of Machine Translation in the Turkic Languages [7.3458368273762815]
Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems.
However, there is still a large number of languages that are yet to reap the benefits of NMT.
This paper provides the first large-scale case study of the practical application of MT in the Turkic language family.
arXiv Detail & Related papers (2021-09-09T23:56:30Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - MENYO-20k: A Multi-domain English-Yor\`ub\'a Corpus for Machine
Translation and Domain Adaptation [1.4553698107056112]
We present MENYO-20k, the first multi-domain parallel corpus for the low-resource Yorub'a--English (yo--en) language pair with standardized train-test splits for benchmarking.
A major gain of BLEU $+9.9$ and $+8.6$ (en2yo) is achieved in comparison to Facebook's M2M-100 and Google multilingual NMT respectively.
arXiv Detail & Related papers (2021-03-15T18:52:32Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - COMET: A Neural Framework for MT Evaluation [8.736370689844682]
We present COMET, a neural framework for training multilingual machine translation evaluation models.
Our framework exploits information from both the source input and a target-language reference translation in order to more accurately predict MT quality.
Our models achieve new state-of-the-art performance on the WMT 2019 Metrics shared task and demonstrate robustness to high-performing systems.
arXiv Detail & Related papers (2020-09-18T18:54:15Z) - Balancing Training for Multilingual Neural Machine Translation [130.54253367251738]
multilingual machine translation (MT) models can translate to/from multiple languages.
Standard practice is to up-sample less resourced languages to increase representation.
We propose a method that instead automatically learns how to weight training data through a data scorer.
arXiv Detail & Related papers (2020-04-14T18:23:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.