Scheduled Sampling Based on Decoding Steps for Neural Machine
Translation
- URL: http://arxiv.org/abs/2108.12963v2
- Date: Tue, 31 Aug 2021 06:21:16 GMT
- Title: Scheduled Sampling Based on Decoding Steps for Neural Machine
Translation
- Authors: Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu and Jie Zhou
- Abstract summary: We propose scheduled sampling methods based on decoding steps, increasing the selection chance of predicted tokens with the growth of decoding steps.
Our approaches significantly outperform the Transformer baseline and vanilla scheduled sampling on three large-scale WMT tasks.
- Score: 25.406119773503786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scheduled sampling is widely used to mitigate the exposure bias problem for
neural machine translation. Its core motivation is to simulate the inference
scene during training by replacing ground-truth tokens with predicted tokens,
thus bridging the gap between training and inference. However, vanilla
scheduled sampling is merely based on training steps and equally treats all
decoding steps. Namely, it simulates an inference scene with uniform error
rates, which disobeys the real inference scene, where larger decoding steps
usually have higher error rates due to error accumulations. To alleviate the
above discrepancy, we propose scheduled sampling methods based on decoding
steps, increasing the selection chance of predicted tokens with the growth of
decoding steps. Consequently, we can more realistically simulate the inference
scene during training, thus better bridging the gap between training and
inference. Moreover, we investigate scheduled sampling based on both training
steps and decoding steps for further improvements. Experimentally, our
approaches significantly outperform the Transformer baseline and vanilla
scheduled sampling on three large-scale WMT tasks. Additionally, our approaches
also generalize well to the text summarization task on two popular benchmarks.
Related papers
- Speeding Up Speech Synthesis In Diffusion Models By Reducing Data Distribution Recovery Steps Via Content Transfer [3.2634122554914002]
Diffusion based vocoders have been criticised for being slow due to the many steps required during sampling.
We propose a setup where the targets are the different outputs of forward process time steps.
We show through extensive evaluation that the proposed technique generates high-fidelity speech in competitive time.
arXiv Detail & Related papers (2023-09-18T10:35:27Z) - Can Diffusion Model Achieve Better Performance in Text Generation?
Bridging the Gap between Training and Inference! [14.979893207094221]
Diffusion models have been successfully adapted to text generation tasks by mapping the discrete text into the continuous space.
There exist nonnegligible gaps between training and inference, owing to the absence of the forward process during inference.
We propose two simple yet effective methods to bridge the gaps mentioned above, named Distance Penalty and Adaptive Decay Sampling.
arXiv Detail & Related papers (2023-05-08T05:32:22Z) - Dynamic Scheduled Sampling with Imitation Loss for Neural Text
Generation [10.306522595622651]
We introduce Dynamic Scheduled Sampling with Imitation Loss (DySI), which maintains the schedule based solely on the training time accuracy.
DySI achieves notable improvements on standard machine translation benchmarks, and significantly improves the robustness of other text generation models.
arXiv Detail & Related papers (2023-01-31T16:41:06Z) - Dense Unsupervised Learning for Video Segmentation [49.46930315961636]
We present a novel approach to unsupervised learning for video object segmentation (VOS)
Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime.
Our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power.
arXiv Detail & Related papers (2021-11-11T15:15:11Z) - Uniform Sampling over Episode Difficulty [55.067544082168624]
We propose a method to approximate episode sampling distributions based on their difficulty.
As the proposed sampling method is algorithm agnostic, we can leverage these insights to improve few-shot learning accuracies.
arXiv Detail & Related papers (2021-08-03T17:58:54Z) - Bi-Granularity Contrastive Learning for Post-Training in Few-Shot Scene [10.822477939237459]
We propose contrastive masked language modeling (CMLM) for post-training to integrate both token-level and sequence-level contrastive learnings.
CMLM surpasses several recent post-training methods in few-shot settings without the need for data augmentation.
arXiv Detail & Related papers (2021-06-04T08:17:48Z) - On Sampling-Based Training Criteria for Neural Language Modeling [97.35284042981675]
We consider Monte Carlo sampling, importance sampling, a novel method we call compensated partial summation, and noise contrastive estimation.
We show that all these sampling methods can perform equally well, as long as we correct for the intended class posterior probabilities.
Experimental results in language modeling and automatic speech recognition on Switchboard and LibriSpeech support our claim.
arXiv Detail & Related papers (2021-04-21T12:55:52Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z) - Bridging the Gap Between Training and Inference for Spatio-Temporal
Forecasting [16.06369357595426]
We propose a novel curriculum learning based strategy named Temporal Progressive Growing Sampling to bridge the gap between training and inference for S-temporal sequence forecasting.
Experimental results demonstrate that our proposed method better models long term dependencies and outperforms baseline approaches on two competitive datasets.
arXiv Detail & Related papers (2020-05-19T10:14:43Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z) - Subset Sampling For Progressive Neural Network Learning [106.12874293597754]
Progressive Neural Network Learning is a class of algorithms that incrementally construct the network's topology and optimize its parameters based on the training data.
We propose to speed up this process by exploiting subsets of training data at each incremental training step.
Experimental results in object, scene and face recognition problems demonstrate that the proposed approach speeds up the optimization procedure considerably.
arXiv Detail & Related papers (2020-02-17T18:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.