Ask Language Model to Clean Your Noisy Translation Data
- URL: http://arxiv.org/abs/2310.13469v3
- Date: Tue, 24 Oct 2023 16:14:55 GMT
- Title: Ask Language Model to Clean Your Noisy Translation Data
- Authors: Quinten Bolding, Baohao Liao, Brandon James Denis, Jun Luo, Christof
Monz
- Abstract summary: We focus on cleaning the noise from the target sentences in MTNT, making it more suitable as a benchmark for noise evaluation.
We show that large language models (LLMs) can effectively rephrase slang, jargon, and profanities.
Experiments on C-MTNT showcased its effectiveness in evaluating the robustness of NMT models.
- Score: 7.246698449812031
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer models have demonstrated remarkable performance in neural machine
translation (NMT). However, their vulnerability to noisy input poses a
significant challenge in practical implementation, where generating clean
output from noisy input is crucial. The MTNT dataset is widely used as a
benchmark for evaluating the robustness of NMT models against noisy input.
Nevertheless, its utility is limited due to the presence of noise in both the
source and target sentences. To address this limitation, we focus on cleaning
the noise from the target sentences in MTNT, making it more suitable as a
benchmark for noise evaluation. Leveraging the capabilities of large language
models (LLMs), we observe their impressive abilities in noise removal. For
example, they can remove emojis while considering their semantic meaning.
Additionally, we show that LLM can effectively rephrase slang, jargon, and
profanities. The resulting datasets, called C-MTNT, exhibit significantly less
noise in the target sentences while preserving the semantic integrity of the
original sentences. Our human and GPT-4 evaluations also lead to a consistent
conclusion that LLM performs well on this task. Lastly, experiments on C-MTNT
showcased its effectiveness in evaluating the robustness of NMT models,
highlighting the potential of advanced language models for data cleaning and
emphasizing C-MTNT as a valuable resource.
Related papers
- Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks.
We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset.
The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z) - Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning [25.230786853723203]
We propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages.
We use Machine Translation to construct pseudo-parallel sentence pairs for low-resource languages.
We introduce a multi-view self-distillation method to learn noise-robust target-language representations.
arXiv Detail & Related papers (2022-08-26T09:32:24Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Can NMT Understand Me? Towards Perturbation-based Evaluation of NMT
Models for Code Generation [1.7616042687330642]
A key step to validate the robustness of the NMT models is to evaluate their performance on adversarial inputs.
In this work, we identify a set of perturbations and metrics tailored for the robustness assessment of such models.
We present a preliminary experimental evaluation, showing what type of perturbations affect the model the most.
arXiv Detail & Related papers (2022-03-29T08:01:39Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Detecting Hallucinated Content in Conditional Neural Sequence Generation [165.68948078624499]
We propose a task to predict whether each token in the output sequence is hallucinated (not contained in the input)
We also introduce a method for learning to detect hallucinations using pretrained language models fine tuned on synthetic data.
arXiv Detail & Related papers (2020-11-05T00:18:53Z) - PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on
User-Generated Contents [40.25277134147149]
We present a new dataset, PheMT, for evaluating the robustness of MT systems against specific linguistic phenomena in Japanese-English translation.
Our experiments with the created dataset revealed that not only our in-house models but even widely used off-the-shelf systems are greatly disturbed by the presence of certain phenomena.
arXiv Detail & Related papers (2020-11-04T04:44:47Z) - Robust Unsupervised Neural Machine Translation with Adversarial
Denoising Training [66.39561682517741]
Unsupervised neural machine translation (UNMT) has attracted great interest in the machine translation community.
The main advantage of the UNMT lies in its easy collection of required large training text sentences.
In this paper, we first time explicitly take the noisy data into consideration to improve the robustness of the UNMT based systems.
arXiv Detail & Related papers (2020-02-28T05:17:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.