MENYO-20k: A Multi-domain English-Yor\`ub\'a Corpus for Machine
Translation and Domain Adaptation
- URL: http://arxiv.org/abs/2103.08647v1
- Date: Mon, 15 Mar 2021 18:52:32 GMT
- Title: MENYO-20k: A Multi-domain English-Yor\`ub\'a Corpus for Machine
Translation and Domain Adaptation
- Authors: David I. Adelani, Dana Ruiter, Jesujoba O. Alabi, Damilola Adebonojo,
Adesina Ayeni, Mofe Adeyemi, Ayodele Awokoya, Cristina Espa\~na-Bonet
- Abstract summary: We present MENYO-20k, the first multi-domain parallel corpus for the low-resource Yorub'a--English (yo--en) language pair with standardized train-test splits for benchmarking.
A major gain of BLEU $+9.9$ and $+8.6$ (en2yo) is achieved in comparison to Facebook's M2M-100 and Google multilingual NMT respectively.
- Score: 1.4553698107056112
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Massively multilingual machine translation (MT) has shown impressive
capabilities, including zero and few-shot translation between low-resource
language pairs. However, these models are often evaluated on high-resource
languages with the assumption that they generalize to low-resource ones. The
difficulty of evaluating MT models on low-resource pairs is often due the lack
of standardized evaluation datasets. In this paper, we present MENYO-20k, the
first multi-domain parallel corpus for the low-resource Yor\`ub\'a--English
(yo--en) language pair with standardized train-test splits for benchmarking. We
provide several neural MT (NMT) benchmarks on this dataset and compare to the
performance of popular pre-trained (massively multilingual) MT models, showing
that, in almost all cases, our simple benchmarks outperform the pre-trained MT
models. A major gain of BLEU $+9.9$ and $+8.6$ (en2yo) is achieved in
comparison to Facebook's M2M-100 and Google multilingual NMT respectively when
we use MENYO-20k to fine-tune generic models.
Related papers
- Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - Machine Translation for Ge'ez Language [0.0]
Machine translation for low-resource languages such as Ge'ez faces challenges such as out-of-vocabulary words, domain mismatches, and lack of labeled training data.
We develop a multilingual neural machine translation (MNMT) model based on languages relatedness.
We also experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches.
arXiv Detail & Related papers (2023-11-24T14:55:23Z) - Distilling Efficient Language-Specific Models for Cross-Lingual Transfer [75.32131584449786]
Massively multilingual Transformers (MMTs) are widely used for cross-lingual transfer learning.
MMTs' language coverage makes them unnecessarily expensive to deploy in terms of model size, inference time, energy, and hardware cost.
We propose to extract compressed, language-specific models from MMTs which retain the capacity of the original MMTs for cross-lingual transfer.
arXiv Detail & Related papers (2023-06-02T17:31:52Z) - Robust Domain Adaptation for Pre-trained Multilingual Neural Machine
Translation Models [0.0]
We propose a fine-tuning procedure for the generic mNMT that combines embeddings freezing and adversarial loss.
Experiments demonstrated that the procedure improves performances on specialized data with a minimal loss in initial performances on generic domain for all languages pairs.
arXiv Detail & Related papers (2022-10-26T18:47:45Z) - SMaLL-100: Introducing Shallow Multilingual Machine Translation Model
for Low-Resource Languages [102.50127671423752]
We introduce SMaLL-100, a distilled version of the M2M-100 (12B) machine translation model covering 100 languages.
We train SMaLL-100 with uniform sampling across all language pairs and therefore focus on preserving the performance of low-resource languages.
Our model achieves comparable results to M2M-100 (1.2B), while being 3.6x smaller and 4.3x faster at inference.
arXiv Detail & Related papers (2022-10-20T22:32:29Z) - Evaluating Multiway Multilingual NMT in the Turkic Languages [11.605271847666005]
We present an evaluation of state-of-the-art approaches to training and evaluating machine translation systems in 22 languages from the Turkic language family.
We train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations.
We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost.
arXiv Detail & Related papers (2021-09-13T19:01:07Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z) - Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context.
We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.