Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine
Translation
- URL: http://arxiv.org/abs/2303.15265v1
- Date: Mon, 27 Mar 2023 14:54:43 GMT
- Title: Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine
Translation
- Authors: Alex Jones, Isaac Caswell, Ishank Saxena, Orhan Firat
- Abstract summary: This work explores a cheap and abundant resource to combat this problem: bilingual lexica.
We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text.
We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements; and (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones.
- Score: 33.6064740446337
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural machine translation (NMT) has progressed rapidly over the past several
years, and modern models are able to achieve relatively high quality using only
monolingual text data, an approach dubbed Unsupervised Machine Translation
(UNMT). However, these models still struggle in a variety of ways, including
aspects of translation that for a human are the easiest - for instance,
correctly translating common nouns. This work explores a cheap and abundant
resource to combat this problem: bilingual lexica. We test the efficacy of
bilingual lexica in a real-world set-up, on 200-language translation models
trained on web-crawled text. We present several findings: (1) using lexical
data augmentation, we demonstrate sizable performance gains for unsupervised
translation; (2) we compare several families of data augmentation,
demonstrating that they yield similar improvements, and can be combined for
even greater improvements; (3) we demonstrate the importance of carefully
curated lexica over larger, noisier ones, especially with larger models; and
(4) we compare the efficacy of multilingual lexicon data versus
human-translated parallel data. Finally, we open-source GATITOS (available at
https://github.com/google-research/url-nlp/tree/main/gatitos), a new
multilingual lexicon for 26 low-resource languages, which had the highest
performance among lexica in our experiments.
Related papers
- Cross-lingual Transfer or Machine Translation? On Data Augmentation for
Monolingual Semantic Textual Similarity [2.422759879602353]
Cross-lingual transfer of Wikipedia data exhibits improved performance for monolingual STS.
We find a superiority of the Wikipedia domain over the NLI domain for these languages, in contrast to prior studies that focused on NLI as training data.
arXiv Detail & Related papers (2024-03-08T12:28:15Z) - A Morphologically-Aware Dictionary-based Data Augmentation Technique for
Machine Translation of Under-Represented Languages [31.18983138590214]
We propose strategies to synthesize parallel data relying on morpho-syntactic information and using bilingual lexicons.
Our methodology adheres to a realistic scenario backed by the small parallel seed data.
It is linguistically informed, as it aims to create augmented data that is more likely to be grammatically correct.
arXiv Detail & Related papers (2024-02-02T22:25:44Z) - XNLI 2.0: Improving XNLI dataset and performance on Cross Lingual
Understanding (XLU) [0.0]
We focus on improving the original XNLI dataset by re-translating the MNLI dataset in all of the 14 different languages present in XNLI.
We also perform experiments by training models in all 15 languages and analyzing their performance on the task of natural language inference.
arXiv Detail & Related papers (2023-01-16T17:24:57Z) - Dict-NMT: Bilingual Dictionary based NMT for Extremely Low Resource
Languages [1.8787713898828164]
We present a detailed analysis of the effects of the quality of dictionaries, training dataset size, language family, etc., on the translation quality.
Results on multiple low-resource test languages show a clear advantage of our bilingual dictionary-based method over the baselines.
arXiv Detail & Related papers (2022-06-09T12:03:29Z) - Back-translation for Large-Scale Multilingual Machine Translation [2.8747398859585376]
This paper aims to build a single multilingual translation system with a hypothesis that a universal cross-language representation leads to better multilingual translation performance.
We extend the exploration of different back-translation methods from bilingual translation to multilingual translation.
Surprisingly, the smaller size of vocabularies perform better, and the extensive monolingual English data offers a modest improvement.
arXiv Detail & Related papers (2021-09-17T18:33:15Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z) - Balancing Training for Multilingual Neural Machine Translation [130.54253367251738]
multilingual machine translation (MT) models can translate to/from multiple languages.
Standard practice is to up-sample less resourced languages to increase representation.
We propose a method that instead automatically learns how to weight training data through a data scorer.
arXiv Detail & Related papers (2020-04-14T18:23:28Z) - Translation Artifacts in Cross-lingual Transfer Learning [51.66536640084888]
We show that machine translation can introduce subtle artifacts that have a notable impact in existing cross-lingual models.
In natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them.
We also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
arXiv Detail & Related papers (2020-04-09T17:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.