Robust Neural Machine Translation: Modeling Orthographic and
Interpunctual Variation
- URL: http://arxiv.org/abs/2009.05460v2
- Date: Mon, 14 Sep 2020 11:16:38 GMT
- Title: Robust Neural Machine Translation: Modeling Orthographic and
Interpunctual Variation
- Authors: Toms Bergmanis, Art\=urs Stafanovi\v{c}s, M\=arcis Pinnis
- Abstract summary: We propose a simple generative noise model to generate adversarial examples of ten different types.
We show that, when tested on noisy data, systems trained using adversarial examples perform almost as well as when translating clean data.
- Score: 3.3194866396158
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Neural machine translation systems typically are trained on curated corpora
and break when faced with non-standard orthography or punctuation. Resilience
to spelling mistakes and typos, however, is crucial as machine translation
systems are used to translate texts of informal origins, such as chat
conversations, social media posts and web pages. We propose a simple generative
noise model to generate adversarial examples of ten different types. We use
these to augment machine translation systems' training data and show that, when
tested on noisy data, systems trained using adversarial examples perform almost
as well as when translating clean data, while baseline systems' performance
drops by 2-3 BLEU points. To measure the robustness and noise invariance of
machine translation systems' outputs, we use the average translation edit rate
between the translation of the original sentence and its noised variants. Using
this measure, we show that systems trained on adversarial examples on average
yield 50% consistency improvements when compared to baselines trained on clean
data.
Related papers
- How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise on Machine Translation [10.739338438716965]
We study the impact of real-world hard-to-detect misalignment noise on machine translation.
By observing the increasing reliability of the model's self-knowledge for distinguishing misaligned and clean data at the token-level, we propose a self-correction approach.
Our method proves effective for real-world noisy web-mined datasets across eight translation tasks.
arXiv Detail & Related papers (2024-07-02T12:15:15Z) - Quality Estimation of Machine Translated Texts based on Direct Evidence
from Training Data [0.0]
We show that the parallel corpus used as training data for training the MT system holds direct clues for estimating the quality of translations produced by the MT system.
Our experiments show that this simple and direct method holds promise for quality estimation of translations produced by any purely data driven machine translation system.
arXiv Detail & Related papers (2023-06-27T11:52:28Z) - How sensitive are translation systems to extra contexts? Mitigating
gender bias in Neural Machine Translation models through relevant contexts [11.684346035745975]
A growing number of studies highlight the inherent gender bias that Neural Machine Translation models incorporate during training.
We investigate whether these models can be instructed to fix their bias during inference using targeted, guided instructions as contexts.
We observe large improvements in reducing the gender bias in translations, across three popular test suites.
arXiv Detail & Related papers (2022-05-22T06:31:54Z) - How Robust is Neural Machine Translation to Language Imbalance in
Multilingual Tokenizer Training? [86.48323488619629]
We analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus.
We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected.
arXiv Detail & Related papers (2022-04-29T17:50:36Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Improving Translation Robustness with Visual Cues and Error Correction [58.97421756225425]
We introduce the idea of visual context to improve translation robustness against noisy texts.
We also propose a novel error correction training regime by treating error correction as an auxiliary task.
arXiv Detail & Related papers (2021-03-12T15:31:34Z) - How Context Affects Language Models' Factual Predictions [134.29166998377187]
We integrate information from a retrieval system with a pre-trained language model in a purely unsupervised way.
We report that augmenting pre-trained language models in this way dramatically improves performance and that the resulting system, despite being unsupervised, is competitive with a supervised machine reading baseline.
arXiv Detail & Related papers (2020-05-10T09:28:12Z) - Can Your Context-Aware MT System Pass the DiP Benchmark Tests? :
Evaluation Benchmarks for Discourse Phenomena in Machine Translation [7.993547048820065]
We introduce the first of their kind MT benchmark datasets that aim to track and hail improvements across four main discourse phenomena.
Surprisingly, we find that existing context-aware models do not improve discourse-related translations consistently across languages and phenomena.
arXiv Detail & Related papers (2020-04-30T07:15:36Z) - Robust Unsupervised Neural Machine Translation with Adversarial
Denoising Training [66.39561682517741]
Unsupervised neural machine translation (UNMT) has attracted great interest in the machine translation community.
The main advantage of the UNMT lies in its easy collection of required large training text sentences.
In this paper, we first time explicitly take the noisy data into consideration to improve the robustness of the UNMT based systems.
arXiv Detail & Related papers (2020-02-28T05:17:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.