Related papers: Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings

Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings

URL: http://arxiv.org/abs/2010.07761v1
Date: Thu, 15 Oct 2020 14:04:03 GMT
Title: Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings
Authors: Phillip Keung, Julian Salazar, Yichao Lu, Noah A. Smith
Abstract summary: We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods.
Score: 51.47607125262885
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods. We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT'14 French-English and WMT'16 German-English tasks and outperforming the previous state-of-the-art. Finally, we enrich the IWSLT'15 English-Vietnamese corpus with pseudo-parallel Wikipedia sentence pairs, yielding a 1.2 BLEU improvement on the low-resource MT task. We demonstrate that unsupervised bitext mining is an effective way of augmenting MT datasets and complements existing techniques like initializing with pre-trained contextual embeddings.

Related papers

Confidence Based Bidirectional Global Context Aware Training Framework for Neural Machine Translation [74.99653288574892]
We propose a Confidence Based Bidirectional Global Context Aware (CBBGCA) training framework for neural machine translation (NMT) Our proposed CBBGCA training framework significantly improves the NMT model by +1.02, +1.30 and +0.57 BLEU scores on three large-scale translation datasets.
arXiv Detail & Related papers (2022-02-28T10:24:22Z)
BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine Translation [53.55009917938002]
We propose to refine the mined bitexts via automatic editing. Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
arXiv Detail & Related papers (2021-11-12T16:00:39Z)
Exploiting Curriculum Learning in Unsupervised Neural Machine Translation [28.75229367700697]
We propose a curriculum learning method to gradually utilize pseudo bi-texts based on their quality from multiple granularities. Experimental results on WMT 14 En-Fr, WMT 16 En-De, WMT 16 En-Ro, and LDC En-Zh translation tasks demonstrate that the proposed method achieves consistent improvements with faster convergence speed.
arXiv Detail & Related papers (2021-09-23T07:18:06Z)
Phrase-level Active Learning for Neural Machine Translation [107.28450614074002]
We propose an active learning setting where we can spend a given budget on translating in-domain data. We select both full sentences and individual phrases from unlabelled data in the new domain for routing to human translators. In a German-English translation task, our active learning approach achieves consistent improvements over uncertainty-based sentence selection methods.
arXiv Detail & Related papers (2021-06-21T19:20:42Z)
Self-supervised and Supervised Joint Training for Resource-rich Machine Translation [30.502625878505732]
Self-supervised pre-training of text representations has been successfully applied to low-resource Neural Machine Translation (NMT) We propose a joint training approach, $F$-XEnDec, to combine self-supervised and supervised learning to optimize NMT models.
arXiv Detail & Related papers (2021-06-08T02:35:40Z)
SJTU-NICT's Supervised and Unsupervised Neural Machine Translation Systems for the WMT20 News Translation Task [111.91077204077817]
We participated in four translation directions of three language pairs: English-Chinese, English-Polish, and German-Upper Sorbian. Based on different conditions of language pairs, we have experimented with diverse neural machine translation (NMT) techniques. In our submissions, the primary systems won the first place on English to Chinese, Polish to English, and German to Upper Sorbian translation directions.
arXiv Detail & Related papers (2020-10-11T00:40:05Z)
Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora. mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.