BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine
Translation
- URL: http://arxiv.org/abs/2111.06787v1
- Date: Fri, 12 Nov 2021 16:00:39 GMT
- Title: BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine
Translation
- Authors: Eleftheria Briakou, Sida I. Wang, Luke Zettlemoyer, Marjan
Ghazvininejad
- Abstract summary: We propose to refine the mined bitexts via automatic editing.
Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
- Score: 53.55009917938002
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mined bitexts can contain imperfect translations that yield unreliable
training signals for Neural Machine Translation (NMT). While filtering such
pairs out is known to improve final model quality, we argue that it is
suboptimal in low-resource conditions where even mined data can be limited. In
our work, we propose instead, to refine the mined bitexts via automatic
editing: given a sentence in a language xf, and a possibly imperfect
translation of it xe, our model generates a revised version xf' or xe' that
yields a more equivalent translation pair (i.e., <xf, xe'> or <xf', xe>). We
use a simple editing strategy by (1) mining potentially imperfect translations
for each sentence in a given bitext, (2) learning a model to reconstruct the
original translations and translate, in a multi-task fashion. Experiments
demonstrate that our approach successfully improves the quality of CCMatrix
mined bitext for 5 low-resource language-pairs and 10 translation directions by
up to ~ 8 BLEU points, in most cases improving upon a competitive
back-translation baseline.
Related papers
- Advancing Translation Preference Modeling with RLHF: A Step Towards
Cost-Effective Solution [57.42593422091653]
We explore leveraging reinforcement learning with human feedback to improve translation quality.
A reward model with strong language capabilities can more sensitively learn the subtle differences in translation quality.
arXiv Detail & Related papers (2024-02-18T09:51:49Z) - Can Synthetic Translations Improve Bitext Quality? [28.910206570036593]
This work explores how synthetic translations can be used to revise potentially imperfect reference translations in mined bitext.
We find that synthetic samples can improve bitext quality without any additional bilingual supervision when they replace the originals.
arXiv Detail & Related papers (2022-03-15T04:36:29Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - Revisiting Context Choices for Context-aware Machine Translation [0.7741539072749042]
We show that multi-source transformer models improve machine translation over standard transformer-base models.
We also show that even though randomly shuffling in-domain context can also improve over baselines, the correct context further improves translation quality.
arXiv Detail & Related papers (2021-09-07T11:03:34Z) - Active Learning for Massively Parallel Translation of Constrained Text
into Low Resource Languages [26.822210580244885]
We translate a closed text that is known in advance and available in many languages into a new and severely low resource language.
We compare the portion-based approach that optimize coherence of the text locally with the random sampling approach that increases coverage of the text globally.
We propose an algorithm for human and machine to work together seamlessly to translate a closed text into a severely low resource language.
arXiv Detail & Related papers (2021-08-16T14:49:50Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Unsupervised Bitext Mining and Translation via Self-trained Contextual
Embeddings [51.47607125262885]
We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text.
We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training.
We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods.
arXiv Detail & Related papers (2020-10-15T14:04:03Z) - Incorporating Bilingual Dictionaries for Low Resource Semi-Supervised
Neural Machine Translation [5.958653653305609]
We incorporate widely available bilingual dictionaries that yield word-by-word translations to generate synthetic sentences.
This automatically expands the vocabulary of the model while maintaining high quality content.
arXiv Detail & Related papers (2020-04-05T02:14:14Z) - Learning Contextualized Sentence Representations for Document-Level
Neural Machine Translation [59.191079800436114]
Document-level machine translation incorporates inter-sentential dependencies into the translation of a source sentence.
We propose a new framework to model cross-sentence dependencies by training neural machine translation (NMT) to predict both the target translation and surrounding sentences of a source sentence.
arXiv Detail & Related papers (2020-03-30T03:38:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.