Noisy UGC Translation at the Character Level: Revisiting Open-Vocabulary
Capabilities and Robustness of Char-Based Models
- URL: http://arxiv.org/abs/2110.12552v1
- Date: Sun, 24 Oct 2021 23:25:54 GMT
- Title: Noisy UGC Translation at the Character Level: Revisiting Open-Vocabulary
Capabilities and Robustness of Char-Based Models
- Authors: Jos\'e Carlos Rosales N\'u\~nez, Guillaume Wisniewski, Djam\'e Seddah
- Abstract summary: This work explores the capacities of character-based Neural Machine Translation to translate noisy User-Generated Content (UGC)
We first study the detrimental impact on translation performance of various user-generated content phenomena on a small annotated dataset.
We show that such models are indeed incapable of handling unknown letters, which leads to catastrophic translation failure once such characters are encountered.
- Score: 6.123324869194193
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work explores the capacities of character-based Neural Machine
Translation to translate noisy User-Generated Content (UGC) with a strong focus
on exploring the limits of such approaches to handle productive UGC phenomena,
which almost by definition, cannot be seen at training time. Within a strict
zero-shot scenario, we first study the detrimental impact on translation
performance of various user-generated content phenomena on a small annotated
dataset we developed, and then show that such models are indeed incapable of
handling unknown letters, which leads to catastrophic translation failure once
such characters are encountered. We further confirm this behavior with a
simple, yet insightful, copy task experiment and highlight the importance of
reducing the vocabulary size hyper-parameter to increase the robustness of
character-based models for machine translation.
Related papers
- Contextual Spelling Correction with Language Model for Low-resource Setting [0.0]
A small-scale word-based transformer LM is trained to provide the SC model with contextual understanding.
Probability of error happening(error model) is extracted from the corpus.
Combination of LM and error model is used to develop the SC model through the well-known noisy channel framework.
arXiv Detail & Related papers (2024-04-28T05:29:35Z) - An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Improving Translation Robustness with Visual Cues and Error Correction [58.97421756225425]
We introduce the idea of visual context to improve translation robustness against noisy texts.
We also propose a novel error correction training regime by treating error correction as an auxiliary task.
arXiv Detail & Related papers (2021-03-12T15:31:34Z) - GTAE: Graph-Transformer based Auto-Encoders for Linguistic-Constrained
Text Style Transfer [119.70961704127157]
Non-parallel text style transfer has attracted increasing research interests in recent years.
Current approaches still lack the ability to preserve the content and even logic of original sentences.
We propose a method called Graph Transformer based Auto-GTAE, which models a sentence as a linguistic graph and performs feature extraction and style transfer at the graph level.
arXiv Detail & Related papers (2021-02-01T11:08:45Z) - Sentence Boundary Augmentation For Neural Machine Translation Robustness [11.290581889247983]
We show that sentence boundary segmentation has the largest impact on quality, and we develop a simple data augmentation strategy to improve segmentation robustness.
We show that sentence boundary segmentation has the largest impact on quality, and we develop a simple data augmentation strategy to improve segmentation robustness.
arXiv Detail & Related papers (2020-10-21T16:44:48Z) - Word Shape Matters: Robust Machine Translation with Visual Embedding [78.96234298075389]
We introduce a new encoding of the input symbols for character-level NLP models.
It encodes the shape of each character through the images depicting the letters when printed.
We name this new strategy visual embedding and it is expected to improve the robustness of NLP models.
arXiv Detail & Related papers (2020-10-20T04:08:03Z) - Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext)
Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z) - On Long-Tailed Phenomena in Neural Machine Translation [50.65273145888896]
State-of-the-art Neural Machine Translation (NMT) models struggle with generating low-frequency tokens.
We propose a new loss function, the Anti-Focal loss, to better adapt model training to the structural dependencies of conditional text generation.
We show the efficacy of the proposed technique on a number of Machine Translation (MT) datasets, demonstrating that it leads to significant gains over cross-entropy.
arXiv Detail & Related papers (2020-10-10T07:00:57Z) - Denoising Large-Scale Image Captioning from Alt-text Data using Content
Selection Models [25.86785379429413]
We show that selecting content words as skeletons helps in generating improved and denoised captions.
We also show that the predicted English skeletons can be further cross-lingually leveraged to generate non-English captions.
We also show that skeleton-based prediction allows for better control of certain caption properties, such as length, content, and gender expression.
arXiv Detail & Related papers (2020-09-10T23:31:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.