Unsupervised Bitext Mining and Translation via Self-trained Contextual
Embeddings
- URL: http://arxiv.org/abs/2010.07761v1
- Date: Thu, 15 Oct 2020 14:04:03 GMT
- Title: Unsupervised Bitext Mining and Translation via Self-trained Contextual
Embeddings
- Authors: Phillip Keung, Julian Salazar, Yichao Lu, Noah A. Smith
- Abstract summary: We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text.
We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training.
We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods.
- Score: 51.47607125262885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe an unsupervised method to create pseudo-parallel corpora for
machine translation (MT) from unaligned text. We use multilingual BERT to
create source and target sentence embeddings for nearest-neighbor search and
adapt the model via self-training. We validate our technique by extracting
parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a
24.5 point increase (absolute) in F1 scores over previous unsupervised methods.
We then improve an XLM-based unsupervised neural MT system pre-trained on
Wikipedia by supplementing it with pseudo-parallel text mined from the same
corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the
WMT'14 French-English and WMT'16 German-English tasks and outperforming the
previous state-of-the-art. Finally, we enrich the IWSLT'15 English-Vietnamese
corpus with pseudo-parallel Wikipedia sentence pairs, yielding a 1.2 BLEU
improvement on the low-resource MT task. We demonstrate that unsupervised
bitext mining is an effective way of augmenting MT datasets and complements
existing techniques like initializing with pre-trained contextual embeddings.
Related papers
- Confidence Based Bidirectional Global Context Aware Training Framework
for Neural Machine Translation [74.99653288574892]
We propose a Confidence Based Bidirectional Global Context Aware (CBBGCA) training framework for neural machine translation (NMT)
Our proposed CBBGCA training framework significantly improves the NMT model by +1.02, +1.30 and +0.57 BLEU scores on three large-scale translation datasets.
arXiv Detail & Related papers (2022-02-28T10:24:22Z) - BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine
Translation [53.55009917938002]
We propose to refine the mined bitexts via automatic editing.
Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
arXiv Detail & Related papers (2021-11-12T16:00:39Z) - Exploiting Curriculum Learning in Unsupervised Neural Machine
Translation [28.75229367700697]
We propose a curriculum learning method to gradually utilize pseudo bi-texts based on their quality from multiple granularities.
Experimental results on WMT 14 En-Fr, WMT 16 En-De, WMT 16 En-Ro, and LDC En-Zh translation tasks demonstrate that the proposed method achieves consistent improvements with faster convergence speed.
arXiv Detail & Related papers (2021-09-23T07:18:06Z) - Phrase-level Active Learning for Neural Machine Translation [107.28450614074002]
We propose an active learning setting where we can spend a given budget on translating in-domain data.
We select both full sentences and individual phrases from unlabelled data in the new domain for routing to human translators.
In a German-English translation task, our active learning approach achieves consistent improvements over uncertainty-based sentence selection methods.
arXiv Detail & Related papers (2021-06-21T19:20:42Z) - Self-supervised and Supervised Joint Training for Resource-rich Machine
Translation [30.502625878505732]
Self-supervised pre-training of text representations has been successfully applied to low-resource Neural Machine Translation (NMT)
We propose a joint training approach, $F$-XEnDec, to combine self-supervised and supervised learning to optimize NMT models.
arXiv Detail & Related papers (2021-06-08T02:35:40Z) - SJTU-NICT's Supervised and Unsupervised Neural Machine Translation
Systems for the WMT20 News Translation Task [111.91077204077817]
We participated in four translation directions of three language pairs: English-Chinese, English-Polish, and German-Upper Sorbian.
Based on different conditions of language pairs, we have experimented with diverse neural machine translation (NMT) techniques.
In our submissions, the primary systems won the first place on English to Chinese, Polish to English, and German to Upper Sorbian translation directions.
arXiv Detail & Related papers (2020-10-11T00:40:05Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.