Related papers: How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise on Machine Translation

How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise on Machine Translation

URL: http://arxiv.org/abs/2407.02208v1
Date: Tue, 2 Jul 2024 12:15:15 GMT
Title: How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise on Machine Translation
Authors: Yan Meng, Di Wu, Christof Monz,
Abstract summary: We study the impact of real-world hard-to-detect misalignment noise on machine translation. By observing the increasing reliability of the model's self-knowledge for distinguishing misaligned and clean data at the token-level, we propose a self-correction approach. Our method proves effective for real-world noisy web-mined datasets across eight translation tasks.
Score: 10.739338438716965
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The massive amounts of web-mined parallel data contain large amounts of noise. Semantic misalignment, as the primary source of the noise, poses a challenge for training machine translation systems. In this paper, we first study the impact of real-world hard-to-detect misalignment noise by proposing a process to simulate the realistic misalignment controlled by semantic similarity. After quantitatively analyzing the impact of simulated misalignment on machine translation, we show the limited effectiveness of widely used pre-filters to improve the translation performance, underscoring the necessity of more fine-grained ways to handle data noise. By observing the increasing reliability of the model's self-knowledge for distinguishing misaligned and clean data at the token-level, we propose a self-correction approach which leverages the model's prediction distribution to revise the training supervision from the ground-truth data over training time. Through comprehensive experiments, we show that our self-correction method not only improves translation performance in the presence of simulated misalignment noise but also proves effective for real-world noisy web-mined datasets across eight translation tasks.

Related papers

Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation [25.410770364140856]
Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) We introduce the notion of dynamic perturbation, which can inject controlled perturbations into the noise embeddings during inference.
arXiv Detail & Related papers (2024-09-03T02:29:01Z)
Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets. We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z)
Noisy Pair Corrector for Dense Retrieval [59.312376423104055]
We propose a novel approach called Noisy Pair Corrector (NPC) NPC consists of a detection module and a correction module. We conduct experiments on text-retrieval benchmarks Natural Question and TriviaQA, code-search benchmarks StaQC and SO-DS.
arXiv Detail & Related papers (2023-11-07T08:27:14Z)
Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z)
Unsupervised Noise adaptation using Data Simulation [21.866522173387715]
We propose a generative adversarial network based method to efficiently learn a converse clean-to-noisy transformation. Experimental results show that our method effectively mitigates the domain mismatch between training and test sets.
arXiv Detail & Related papers (2023-02-23T12:57:20Z)
NLIP: Noise-robust Language-Image Pre-training [95.13287735264937]
We propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion. Our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way.
arXiv Detail & Related papers (2022-12-14T08:19:30Z)
Learning from Noisy Labels with Coarse-to-Fine Sample Credibility Modeling [22.62790706276081]
Training deep neural network (DNN) with noisy labels is practically challenging. Previous efforts tend to handle part or full data in a unified denoising flow. We propose a coarse-to-fine robust learning method called CREMA to handle noisy data in a divide-and-conquer manner.
arXiv Detail & Related papers (2022-08-23T02:06:38Z)
On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice. By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data. We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z)
Bridging the Gap Between Clean Data Training and Real-World Inference for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference. We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space. Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z)
A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction [54.569707226277735]
Existing approaches for grammatical error correction (GEC) rely on supervised learning with manually created GEC datasets. There is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected. We propose a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models.
arXiv Detail & Related papers (2020-10-07T04:45:09Z)
Robust Neural Machine Translation: Modeling Orthographic and Interpunctual Variation [3.3194866396158]
We propose a simple generative noise model to generate adversarial examples of ten different types. We show that, when tested on noisy data, systems trained using adversarial examples perform almost as well as when translating clean data.
arXiv Detail & Related papers (2020-09-11T14:12:54Z)
Robust Unsupervised Neural Machine Translation with Adversarial Denoising Training [66.39561682517741]
Unsupervised neural machine translation (UNMT) has attracted great interest in the machine translation community. The main advantage of the UNMT lies in its easy collection of required large training text sentences. In this paper, we first time explicitly take the noisy data into consideration to improve the robustness of the UNMT based systems.
arXiv Detail & Related papers (2020-02-28T05:17:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.