Modeling Homophone Noise for Robust Neural Machine Translation
- URL: http://arxiv.org/abs/2012.08396v1
- Date: Tue, 15 Dec 2020 16:12:04 GMT
- Title: Modeling Homophone Noise for Robust Neural Machine Translation
- Authors: Wenjie Qin, Xiang Li, Yuhui Sun, Deyi Xiong, Jianwei Cui, Bin Wang
- Abstract summary: The framework consists of a homophone noise detector and a syllable-aware NMT model to homophone errors.
The detector identifies potential homophone errors in a textual sentence and converts them into syllables to form a mixed sequence that is then fed into the syllable-aware NMT.
- Score: 23.022527815382862
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a robust neural machine translation (NMT)
framework. The framework consists of a homophone noise detector and a
syllable-aware NMT model to homophone errors. The detector identifies potential
homophone errors in a textual sentence and converts them into syllables to form
a mixed sequence that is then fed into the syllable-aware NMT. Extensive
experiments on Chinese->English translation demonstrate that our proposed
method not only significantly outperforms baselines on noisy test sets with
homophone noise, but also achieves a substantial improvement on clean text.
Related papers
- Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Learning Homographic Disambiguation Representation for Neural Machine
Translation [20.242134720005467]
Homographs, words with the same spelling but different meanings, remain challenging in Neural Machine Translation (NMT)
We propose a novel approach to tackle issues of NMT in the latent space.
We first train an encoder (aka " homographic-encoder") to learn universal sentence representations in a natural language inference (NLI) task.
We further fine-tune the encoder using homograph-based syn-set WordNet, enabling it to learn word-set representations from sentences.
arXiv Detail & Related papers (2023-04-12T13:42:59Z) - READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input
Noises [87.70001456418504]
We construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises.
READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input.
We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN.
arXiv Detail & Related papers (2023-02-14T20:14:39Z) - Frequency-Aware Contrastive Learning for Neural Machine Translation [24.336356651877388]
Low-frequency word prediction remains a challenge in modern neural machine translation (NMT) systems.
Inspired by the observation that low-frequency words form a more compact embedding space, we tackle this challenge from a representation learning perspective.
We propose a frequency-aware token-level contrastive learning method, in which the hidden state of each decoding step is pushed away from the counterparts of other target words.
arXiv Detail & Related papers (2021-12-29T10:10:10Z) - Integrated Semantic and Phonetic Post-correction for Chinese Speech
Recognition [1.2914521751805657]
We propose a novel approach to collectively exploit the contextualized representation and the phonetic information between the error and its replacing candidates to alleviate the error rate of Chinese ASR.
Our experiment results on real world speech recognition showed that our proposed method has evidently lower than the baseline model.
arXiv Detail & Related papers (2021-11-16T11:55:27Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Addressing the Vulnerability of NMT in Input Perturbations [10.103375853643547]
We improve robustness of NMT models by reducing the effect of noisy words through a Context-Enhanced Reconstruction approach.
CER trains the model to resist noise in two steps: (1) step that breaks the naturalness of input sequence with made-up words; (2) reconstruction step that defends the noise propagation by generating better and more robust contextual representation.
arXiv Detail & Related papers (2021-04-20T07:52:58Z) - Improving Translation Robustness with Visual Cues and Error Correction [58.97421756225425]
We introduce the idea of visual context to improve translation robustness against noisy texts.
We also propose a novel error correction training regime by treating error correction as an auxiliary task.
arXiv Detail & Related papers (2021-03-12T15:31:34Z) - Robust Unsupervised Neural Machine Translation with Adversarial
Denoising Training [66.39561682517741]
Unsupervised neural machine translation (UNMT) has attracted great interest in the machine translation community.
The main advantage of the UNMT lies in its easy collection of required large training text sentences.
In this paper, we first time explicitly take the noisy data into consideration to improve the robustness of the UNMT based systems.
arXiv Detail & Related papers (2020-02-28T05:17:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.