Facebook AI's WMT20 News Translation Task Submission
- URL: http://arxiv.org/abs/2011.08298v1
- Date: Mon, 16 Nov 2020 21:49:00 GMT
- Title: Facebook AI's WMT20 News Translation Task Submission
- Authors: Peng-Jen Chen, Ann Lee, Changhan Wang, Naman Goyal, Angela Fan, Mary
Williamson, Jiatao Gu
- Abstract summary: This paper describes Facebook AI's submission to WMT20 shared news translation task.
We focus on the low resource setting and participate in two language pairs, Tamil -> English and Inuktitut -> English.
We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain.
- Score: 69.92594751788403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes Facebook AI's submission to WMT20 shared news
translation task. We focus on the low resource setting and participate in two
language pairs, Tamil <-> English and Inuktitut <-> English, where there are
limited out-of-domain bitext and monolingual data. We approach the low resource
problem using two main strategies, leveraging all available data and adapting
the system to the target news domain. We explore techniques that leverage
bitext and monolingual data from all languages, such as self-supervised model
pretraining, multilingual models, data augmentation, and reranking. To better
adapt the translation system to the test domain, we explore dataset tagging and
fine-tuning on in-domain data. We observe that different techniques provide
varied improvements based on the available data of the language pair. Based on
the finding, we integrate these techniques into one training pipeline. For
En->Ta, we explore an unconstrained setup with additional Tamil bitext and
monolingual data and show that further improvement can be obtained. On the test
set, our best submitted systems achieve 21.5 and 13.7 BLEU for Ta->En and
En->Ta respectively, and 27.9 and 13.0 for Iu->En and En->Iu respectively.
Related papers
- Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Netmarble AI Center's WMT21 Automatic Post-Editing Shared Task
Submission [6.043109546012043]
This paper describes Netmarble's submission to WMT21 Automatic Post-Editing (APE) Shared Task for the English-German language pair.
Facebook Fair's WMT19 news translation model was chosen to engage the large and powerful pre-trained neural networks.
For better performance, we leverage external translations as augmented machine translation (MT) during the post-training and fine-tuning.
arXiv Detail & Related papers (2021-09-14T08:21:18Z) - Facebook AI WMT21 News Translation Task Submission [23.69817809546458]
We describe Facebook's multilingual model submission to the WMT2021 shared task on news translation.
We participate in 14 language directions: English to and from Czech, German, Hausa, Icelandic, Japanese, Russian, and Chinese.
We utilize data from all available sources to create high quality bilingual and multilingual baselines.
arXiv Detail & Related papers (2021-08-06T18:26:38Z) - AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT [9.797319790710711]
AUGVIC is a novel data augmentation framework for low-resource NMT.
It exploits the vicinal samples of the given bitext without using any extra monolingual data explicitly.
We show that AUGVIC helps to attenuate the discrepancies between relevant and distant-domain monolingual data in traditional back-translation.
arXiv Detail & Related papers (2021-06-09T15:29:18Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - MCL@IITK at SemEval-2021 Task 2: Multilingual and Cross-lingual
Word-in-Context Disambiguation using Augmented Data, Signals, and
Transformers [1.869621561196521]
We present our approach for solving the SemEval 2021 Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation (MCL-WiC)
The goal is to detect whether a given word common to both the sentences evokes the same meaning.
We submit systems for both the settings - Multilingual and Cross-Lingual.
arXiv Detail & Related papers (2021-04-04T08:49:28Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Unsupervised Bitext Mining and Translation via Self-trained Contextual
Embeddings [51.47607125262885]
We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text.
We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training.
We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods.
arXiv Detail & Related papers (2020-10-15T14:04:03Z) - Balancing Training for Multilingual Neural Machine Translation [130.54253367251738]
multilingual machine translation (MT) models can translate to/from multiple languages.
Standard practice is to up-sample less resourced languages to increase representation.
We propose a method that instead automatically learns how to weight training data through a data scorer.
arXiv Detail & Related papers (2020-04-14T18:23:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.