Mitigating Catastrophic Forgetting in Scheduled Sampling with Elastic
Weight Consolidation in Neural Machine Translation
- URL: http://arxiv.org/abs/2109.06308v1
- Date: Mon, 13 Sep 2021 20:37:58 GMT
- Title: Mitigating Catastrophic Forgetting in Scheduled Sampling with Elastic
Weight Consolidation in Neural Machine Translation
- Authors: Michalis Korakakis, Andreas Vlachos
- Abstract summary: Autoregressive models trained with maximum likelihood estimation suffer from exposure bias.
We propose using Elastic Weight Consolidation as trade-off between mitigating exposure bias and retaining output quality.
Experiments on two IWSLT'14 translation tasks demonstrate that our approach alleviates catastrophic forgetting and significantly improves BLEU.
- Score: 15.581515781839656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite strong performance in many sequence-to-sequence tasks, autoregressive
models trained with maximum likelihood estimation suffer from exposure bias,
i.e. a discrepancy between the ground-truth prefixes used during training and
the model-generated prefixes used at inference time. Scheduled sampling is a
simple and often empirically successful approach which addresses this issue by
incorporating model-generated prefixes into the training process. However, it
has been argued that it is an inconsistent training objective leading to models
ignoring the prefixes altogether. In this paper, we conduct systematic
experiments and find that it ameliorates exposure bias by increasing model
reliance on the input sequence. We also observe that as a side-effect, it
worsens performance when the model-generated prefix is correct, a form of
catastrophic forgetting. We propose using Elastic Weight Consolidation as
trade-off between mitigating exposure bias and retaining output quality.
Experiments on two IWSLT'14 translation tasks demonstrate that our approach
alleviates catastrophic forgetting and significantly improves BLEU compared to
standard scheduled sampling.
Related papers
- Post-Hoc Reversal: Are We Selecting Models Prematurely? [13.910702424593797]
We show a phenomenon that we call post-hoc reversal, where performance trends are reversed after applying post-hoc transforms.
Preliminary analyses suggest that these transforms induce reversal by suppressing the influence of mislabeled examples.
We propose post-hoc selection, a simple technique whereby post-hoc metrics inform model development decisions.
arXiv Detail & Related papers (2024-04-11T14:58:19Z) - Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation [53.27596811146316]
Diffusion models operate over a sequence of timesteps instead of instantaneous input-output relationships in previous contexts.
We present Diffusion-TracIn that incorporates this temporal dynamics and observe that samples' loss gradient norms are highly dependent on timestep.
We introduce Diffusion-ReTrac as a re-normalized adaptation that enables the retrieval of training samples more targeted to the test sample of interest.
arXiv Detail & Related papers (2024-01-17T07:58:18Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion
Schedule Flaws and Enhancing Low-Frequency Controls [77.42510898755037]
One More Step (OMS) is a compact network that incorporates an additional simple yet effective step during inference.
OMS elevates image fidelity and harmonizes the dichotomy between training and inference, while preserving original model parameters.
Once trained, various pre-trained diffusion models with the same latent domain can share the same OMS module.
arXiv Detail & Related papers (2023-11-27T12:02:42Z) - Dynamic Scheduled Sampling with Imitation Loss for Neural Text
Generation [10.306522595622651]
We introduce Dynamic Scheduled Sampling with Imitation Loss (DySI), which maintains the schedule based solely on the training time accuracy.
DySI achieves notable improvements on standard machine translation benchmarks, and significantly improves the robustness of other text generation models.
arXiv Detail & Related papers (2023-01-31T16:41:06Z) - Debiased Fine-Tuning for Vision-language Models by Prompt Regularization [50.41984119504716]
We present a new paradigm for fine-tuning large-scale vision pre-trained models on downstream task, dubbed Prompt Regularization (ProReg)
ProReg uses the prediction by prompting the pretrained model to regularize the fine-tuning.
We show the consistently strong performance of ProReg compared with conventional fine-tuning, zero-shot prompt, prompt tuning, and other state-of-the-art methods.
arXiv Detail & Related papers (2023-01-29T11:53:55Z) - Input Perturbation Reduces Exposure Bias in Diffusion Models [41.483581603727444]
We show that a long sampling chain leads to an error accumulation phenomenon, similar to the exposure bias problem in autoregressive text generation.
We propose a very simple but effective training regularization, consisting in perturbing the ground truth samples to simulate the inference time prediction errors.
We empirically show that, without affecting the recall and precision, the proposed input perturbation leads to a significant improvement in the sample quality.
arXiv Detail & Related papers (2023-01-27T13:34:54Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - Improving Maximum Likelihood Training for Text Generation with Density
Ratio Estimation [51.091890311312085]
We propose a new training scheme for auto-regressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation.
Our method stably outperforms Maximum Likelihood Estimation and other state-of-the-art sequence generative models in terms of both quality and diversity.
arXiv Detail & Related papers (2020-07-12T15:31:24Z) - Bayesian Sampling Bias Correction: Training with the Right Loss Function [0.0]
We derive a family of loss functions to train models in the presence of sampling bias.
Examples are when the prevalence of a pathology differs from its sampling rate in the training dataset, or when a machine learning practioner rebalances their training dataset.
arXiv Detail & Related papers (2020-06-24T15:10:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.