Simplify-then-Translate: Automatic Preprocessing for Black-Box Machine
Translation
- URL: http://arxiv.org/abs/2005.11197v2
- Date: Wed, 27 May 2020 15:37:50 GMT
- Title: Simplify-then-Translate: Automatic Preprocessing for Black-Box Machine
Translation
- Authors: Sneha Mehta, Bahareh Azarnoush, Boris Chen, Avneesh Saluja, Vinith
Misra, Ballav Bihani, Ritwik Kumar
- Abstract summary: We introduce a method to improve black-box machine translation systems via automatic pre-processing (APP) using sentence simplification.
We first propose a method to automatically generate a large in-domain paraphrase corpus through back-translation with a black-box MT system.
We show that this preprocessing leads to better translation performance as compared to non-preprocessed source sentences.
- Score: 5.480070710278571
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Black-box machine translation systems have proven incredibly useful for a
variety of applications yet by design are hard to adapt, tune to a specific
domain, or build on top of. In this work, we introduce a method to improve such
systems via automatic pre-processing (APP) using sentence simplification. We
first propose a method to automatically generate a large in-domain paraphrase
corpus through back-translation with a black-box MT system, which is used to
train a paraphrase model that "simplifies" the original sentence to be more
conducive for translation. The model is used to preprocess source sentences of
multiple low-resource language pairs. We show that this preprocessing leads to
better translation performance as compared to non-preprocessed source
sentences. We further perform side-by-side human evaluation to verify that
translations of the simplified sentences are better than the original ones.
Finally, we provide some guidance on recommended language pairs for generating
the simplification model corpora by investigating the relationship between ease
of translation of a language pair (as measured by BLEU) and quality of the
resulting simplification model from back-translations of this language pair (as
measured by SARI), and tie this into the downstream task of low-resource
translation.
Related papers
- Contextual Refinement of Translations: Large Language Models for Sentence and Document-Level Post-Editing [12.843274390224853]
Large Language Models (LLM's) have demonstrated considerable success in various Natural Language Processing tasks.
We show that they have yet to attain state-of-the-art performance in Neural Machine Translation.
We propose adapting LLM's as Automatic Post-Editors (APE) rather than direct translators.
arXiv Detail & Related papers (2023-10-23T12:22:15Z) - Decomposed Prompting for Machine Translation Between Related Languages
using Large Language Models [55.35106713257871]
We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations.
We show that DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.
arXiv Detail & Related papers (2023-05-22T14:52:47Z) - BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine
Translation [53.55009917938002]
We propose to refine the mined bitexts via automatic editing.
Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
arXiv Detail & Related papers (2021-11-12T16:00:39Z) - Unsupervised Neural Machine Translation with Generative Language Models
Only [19.74865387759671]
We show how to derive state-of-the-art unsupervised neural machine translation systems from generatively pre-trained language models.
Our method consists of three steps: few-shot amplification, distillation, and backtranslation.
arXiv Detail & Related papers (2021-10-11T17:35:34Z) - Phrase-level Active Learning for Neural Machine Translation [107.28450614074002]
We propose an active learning setting where we can spend a given budget on translating in-domain data.
We select both full sentences and individual phrases from unlabelled data in the new domain for routing to human translators.
In a German-English translation task, our active learning approach achieves consistent improvements over uncertainty-based sentence selection methods.
arXiv Detail & Related papers (2021-06-21T19:20:42Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Semi-Supervised Text Simplification with Back-Translation and Asymmetric
Denoising Autoencoders [37.949101113934226]
Text simplification (TS) rephrases long sentences into simplified variants while preserving inherent semantics.
This work investigates how to leverage large amounts of unpaired corpora in TS task.
We propose asymmetric denoising methods for sentences with separate complexity.
arXiv Detail & Related papers (2020-04-30T11:19:04Z) - Re-translation versus Streaming for Simultaneous Translation [14.800214853561823]
We study a problem in which revisions to the hypothesis beyond strictly appending words are permitted.
In this setting, we compare custom streaming approaches to re-translation.
We find re-translation to be as good or better than state-of-the-art streaming systems.
arXiv Detail & Related papers (2020-04-07T18:27:32Z) - Learning Contextualized Sentence Representations for Document-Level
Neural Machine Translation [59.191079800436114]
Document-level machine translation incorporates inter-sentential dependencies into the translation of a source sentence.
We propose a new framework to model cross-sentence dependencies by training neural machine translation (NMT) to predict both the target translation and surrounding sentences of a source sentence.
arXiv Detail & Related papers (2020-03-30T03:38:01Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.