Dynamic Scheduled Sampling with Imitation Loss for Neural Text
Generation
- URL: http://arxiv.org/abs/2301.13753v1
- Date: Tue, 31 Jan 2023 16:41:06 GMT
- Title: Dynamic Scheduled Sampling with Imitation Loss for Neural Text
Generation
- Authors: Xiang Lin, Prathyusha Jwalapuram and Shafiq Joty
- Abstract summary: We introduce Dynamic Scheduled Sampling with Imitation Loss (DySI), which maintains the schedule based solely on the training time accuracy.
DySI achieves notable improvements on standard machine translation benchmarks, and significantly improves the robustness of other text generation models.
- Score: 10.306522595622651
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art neural text generation models are typically trained to
maximize the likelihood of each token in the ground-truth sequence conditioned
on the previous target tokens. However, during inference, the model needs to
make a prediction conditioned on the tokens generated by itself. This
train-test discrepancy is referred to as exposure bias. Scheduled sampling is a
curriculum learning strategy that gradually exposes the model to its own
predictions during training to mitigate this bias. Most of the proposed
approaches design a scheduler based on training steps, which generally requires
careful tuning depending on the training setup. In this work, we introduce
Dynamic Scheduled Sampling with Imitation Loss (DySI), which maintains the
schedule based solely on the training time accuracy, while enhancing the
curriculum learning by introducing an imitation loss, which attempts to make
the behavior of the decoder indistinguishable from the behavior of a
teacher-forced decoder. DySI is universally applicable across training setups
with minimal tuning. Extensive experiments and analysis show that DySI not only
achieves notable improvements on standard machine translation benchmarks, but
also significantly improves the robustness of other text generation models.
Related papers
- Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - Training dynamic models using early exits for automatic speech
recognition on resource-constrained devices [15.879328412777008]
Early-exit architectures enable the development of dynamic models capable of adapting their size and architecture to varying levels of computational resources and ASR performance demands.
We show that early-exit models trained from scratch not only preserve performance when using fewer encoder layers but also exhibit enhanced task accuracy compared to single-exit or pre-trained models.
Results provide insights into the training dynamics of early-exit architectures for ASR models.
arXiv Detail & Related papers (2023-09-18T07:45:16Z) - Debiased Fine-Tuning for Vision-language Models by Prompt Regularization [50.41984119504716]
We present a new paradigm for fine-tuning large-scale vision pre-trained models on downstream task, dubbed Prompt Regularization (ProReg)
ProReg uses the prediction by prompting the pretrained model to regularize the fine-tuning.
We show the consistently strong performance of ProReg compared with conventional fine-tuning, zero-shot prompt, prompt tuning, and other state-of-the-art methods.
arXiv Detail & Related papers (2023-01-29T11:53:55Z) - Same Pre-training Loss, Better Downstream: Implicit Bias Matters for
Language Models [46.24479693469042]
This paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not.
arXiv Detail & Related papers (2022-10-25T17:45:36Z) - Semi-Supervised Learning Based on Reference Model for Low-resource TTS [32.731900584216724]
We propose a semi-supervised learning method for neural TTS in which labeled target data is limited.
Experimental results show that our proposed semi-supervised learning scheme with limited target data significantly improves the voice quality for test data to achieve naturalness and robustness in speech synthesis.
arXiv Detail & Related papers (2022-10-25T07:48:07Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Effective and Efficient Training for Sequential Recommendation using
Recency Sampling [91.02268704681124]
We propose a novel Recency-based Sampling of Sequences training objective.
We show that the models enhanced with our method can achieve performances exceeding or very close to stateof-the-art BERT4Rec.
arXiv Detail & Related papers (2022-07-06T13:06:31Z) - Mitigating Catastrophic Forgetting in Scheduled Sampling with Elastic
Weight Consolidation in Neural Machine Translation [15.581515781839656]
Autoregressive models trained with maximum likelihood estimation suffer from exposure bias.
We propose using Elastic Weight Consolidation as trade-off between mitigating exposure bias and retaining output quality.
Experiments on two IWSLT'14 translation tasks demonstrate that our approach alleviates catastrophic forgetting and significantly improves BLEU.
arXiv Detail & Related papers (2021-09-13T20:37:58Z) - Scheduled Sampling Based on Decoding Steps for Neural Machine
Translation [25.406119773503786]
We propose scheduled sampling methods based on decoding steps, increasing the selection chance of predicted tokens with the growth of decoding steps.
Our approaches significantly outperform the Transformer baseline and vanilla scheduled sampling on three large-scale WMT tasks.
arXiv Detail & Related papers (2021-08-30T02:41:42Z) - Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution.
This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes.
Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.