Translating away Translationese without Parallel Data
- URL: http://arxiv.org/abs/2310.18830v1
- Date: Sat, 28 Oct 2023 22:11:25 GMT
- Title: Translating away Translationese without Parallel Data
- Authors: Rricha Jalota, Koel Dutta Chowdhury, Cristina Espa\~na-Bonet, Josef
van Genabith
- Abstract summary: Translated texts exhibit systematic linguistic differences compared to original texts in the same language.
In this paper, we explore a novel approach to reduce translationese in translated texts: translation-based style transfer.
We show how we can eliminate the need for parallel validation data by combining the self-supervised loss with an unsupervised loss.
- Score: 14.423809260672877
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Translated texts exhibit systematic linguistic differences compared to
original texts in the same language, and these differences are referred to as
translationese. Translationese has effects on various cross-lingual natural
language processing tasks, potentially leading to biased results. In this
paper, we explore a novel approach to reduce translationese in translated
texts: translation-based style transfer. As there are no parallel
human-translated and original data in the same language, we use a
self-supervised approach that can learn from comparable (rather than parallel)
mono-lingual original and translated data. However, even this self-supervised
approach requires some parallel data for validation. We show how we can
eliminate the need for parallel validation data by combining the
self-supervised loss with an unsupervised loss. This unsupervised loss
leverages the original language model loss over the style-transferred output
and a semantic similarity loss between the input and style-transferred output.
We evaluate our approach in terms of original vs. translationese binary
classification in addition to measuring content preservation and target-style
fluency. The results show that our approach is able to reduce translationese
classifier accuracy to a level of a random classifier after style transfer
while adequately preserving the content and fluency in the target original
style.
Related papers
- Original or Translated? On the Use of Parallel Data for Translation
Quality Estimation [81.27850245734015]
We demonstrate a significant gap between parallel data and real QE data.
For parallel data, it is indiscriminate and the translationese may occur on either source or target side.
We find that using the source-original part of parallel corpus consistently outperforms its target-original counterpart.
arXiv Detail & Related papers (2022-12-20T14:06:45Z) - Principled Paraphrase Generation with Parallel Corpora [52.78059089341062]
We formalize the implicit similarity function induced by round-trip Machine Translation.
We show that it is susceptible to non-paraphrase pairs sharing a single ambiguous translation.
We design an alternative similarity metric that mitigates this issue.
arXiv Detail & Related papers (2022-05-24T17:22:42Z) - Towards Debiasing Translation Artifacts [15.991970288297443]
We propose a novel approach to reducing translationese by extending an established bias-removal technique.
We use the Iterative Null-space Projection (INLP) algorithm, and show by measuring classification accuracy before and after debiasing, that translationese is reduced at both sentence and word level.
To the best of our knowledge, this is the first study to debias translationese as represented in latent embedding space.
arXiv Detail & Related papers (2022-05-16T21:46:51Z) - Bridging the Data Gap between Training and Inference for Unsupervised
Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference.
The source discrepancy between training and inference hinders the translation performance of UNMT models.
We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z) - As Little as Possible, as Much as Necessary: Detecting Over- and
Undertranslations with Contrastive Conditioning [42.46681912294797]
We propose a method for detecting superfluous words in neural machine translation.
We compare the likelihood of a full sequence under a translation model to the likelihood of its parts, given the corresponding source or target sequence.
This allows to pinpoint superfluous words in the translation and untranslated words in the source even in the absence of a reference translation.
arXiv Detail & Related papers (2022-03-03T18:59:02Z) - Oolong: Investigating What Makes Transfer Learning Hard with Controlled
Studies [21.350999136803843]
We systematically transform the language of the GLUE benchmark, altering one axis of crosslingual variation at a time.
We find that models can largely recover from syntactic-style shifts, but cannot recover from vocabulary misalignment.
Our experiments provide insights into the factors of cross-lingual transfer that researchers should most focus on when designing language transfer scenarios.
arXiv Detail & Related papers (2022-02-24T19:00:39Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - A Probabilistic Formulation of Unsupervised Text Style Transfer [128.80213211598752]
We present a deep generative model for unsupervised text style transfer that unifies previously proposed non-generative techniques.
By hypothesizing a parallel latent sequence that generates each observed sequence, our model learns to transform sequences from one domain to another in a completely unsupervised fashion.
arXiv Detail & Related papers (2020-02-10T16:20:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.