Sentence Boundary Augmentation For Neural Machine Translation Robustness
- URL: http://arxiv.org/abs/2010.11132v1
- Date: Wed, 21 Oct 2020 16:44:48 GMT
- Title: Sentence Boundary Augmentation For Neural Machine Translation Robustness
- Authors: Daniel Li, Te I, Naveen Arivazhagan, Colin Cherry, Dirk Padfield
- Abstract summary: We show that sentence boundary segmentation has the largest impact on quality, and we develop a simple data augmentation strategy to improve segmentation robustness.
We show that sentence boundary segmentation has the largest impact on quality, and we develop a simple data augmentation strategy to improve segmentation robustness.
- Score: 11.290581889247983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural Machine Translation (NMT) models have demonstrated strong state of the
art performance on translation tasks where well-formed training and evaluation
data are provided, but they remain sensitive to inputs that include errors of
various types. Specifically, in the context of long-form speech translation
systems, where the input transcripts come from Automatic Speech Recognition
(ASR), the NMT models have to handle errors including phoneme substitutions,
grammatical structure, and sentence boundaries, all of which pose challenges to
NMT robustness. Through in-depth error analysis, we show that sentence boundary
segmentation has the largest impact on quality, and we develop a simple data
augmentation strategy to improve segmentation robustness.
Related papers
- LANDeRMT: Detecting and Routing Language-Aware Neurons for Selectively Finetuning LLMs to Machine Translation [43.26446958873554]
Large language models (LLMs) have shown promising results in multilingual translation even with limited bilingual supervision.
Recent advancements in large language models (LLMs) have shown promising results in multilingual translation even with limited bilingual supervision.
LandeRMT is a framework that selectively finetunes LLMs to textbfMachine textbfTranslation with diverse translation training data.
arXiv Detail & Related papers (2024-09-29T02:39:42Z) - A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations [0.4499833362998489]
This study focuses on the case of English-Marathi language pairs, where existing datasets are notably noisy.
To mitigate the impact of data quality issues, we propose a data filtering approach based on cross-lingual sentence representations.
Results demonstrate a significant improvement in translation quality over the baseline post-filtering with IndicSBERT.
arXiv Detail & Related papers (2024-09-04T13:49:45Z) - Code-Switching with Word Senses for Pretraining in Neural Machine
Translation [107.23743153715799]
We introduce Word Sense Pretraining for Neural Machine Translation (WSP-NMT)
WSP-NMT is an end-to-end approach for pretraining multilingual NMT models leveraging word sense-specific information from Knowledge Bases.
Our experiments show significant improvements in overall translation quality.
arXiv Detail & Related papers (2023-10-21T16:13:01Z) - SALTED: A Framework for SAlient Long-Tail Translation Error Detection [17.914521288548844]
We introduce SALTED, a specifications-based framework for behavioral testing of machine translation models.
At the core of our approach is the development of high-precision detectors that flag errors between a source sentence and a system output.
We demonstrate that such detectors could be used not just to identify salient long-tail errors in MT systems, but also for higher-recall filtering of the training data.
arXiv Detail & Related papers (2022-05-20T06:45:07Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on
User-Generated Contents [40.25277134147149]
We present a new dataset, PheMT, for evaluating the robustness of MT systems against specific linguistic phenomena in Japanese-English translation.
Our experiments with the created dataset revealed that not only our in-house models but even widely used off-the-shelf systems are greatly disturbed by the presence of certain phenomena.
arXiv Detail & Related papers (2020-11-04T04:44:47Z) - On Long-Tailed Phenomena in Neural Machine Translation [50.65273145888896]
State-of-the-art Neural Machine Translation (NMT) models struggle with generating low-frequency tokens.
We propose a new loss function, the Anti-Focal loss, to better adapt model training to the structural dependencies of conditional text generation.
We show the efficacy of the proposed technique on a number of Machine Translation (MT) datasets, demonstrating that it leads to significant gains over cross-entropy.
arXiv Detail & Related papers (2020-10-10T07:00:57Z) - It's Easier to Translate out of English than into it: Measuring Neural
Translation Difficulty by Cross-Mutual Information [90.35685796083563]
Cross-mutual information (XMI) is an asymmetric information-theoretic metric of machine translation difficulty.
XMI exploits the probabilistic nature of most neural machine translation models.
We present the first systematic and controlled study of cross-lingual translation difficulties using modern neural translation systems.
arXiv Detail & Related papers (2020-05-05T17:38:48Z) - On the Inference Calibration of Neural Machine Translation [54.48932804996506]
We study the correlation between calibration and translation performance and linguistic properties of miscalibration.
We propose a new graduated label smoothing method that can improve both inference calibration and translation performance.
arXiv Detail & Related papers (2020-05-03T02:03:56Z) - Robust Unsupervised Neural Machine Translation with Adversarial
Denoising Training [66.39561682517741]
Unsupervised neural machine translation (UNMT) has attracted great interest in the machine translation community.
The main advantage of the UNMT lies in its easy collection of required large training text sentences.
In this paper, we first time explicitly take the noisy data into consideration to improve the robustness of the UNMT based systems.
arXiv Detail & Related papers (2020-02-28T05:17:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.