Confidence-Aware Scheduled Sampling for Neural Machine Translation
- URL: http://arxiv.org/abs/2107.10427v1
- Date: Thu, 22 Jul 2021 02:49:04 GMT
- Title: Confidence-Aware Scheduled Sampling for Neural Machine Translation
- Authors: Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu and Jie Zhou
- Abstract summary: We propose confidence-aware scheduled sampling for neural machine translation.
We quantify real-time model competence by the confidence of model predictions.
Our approach significantly outperforms the Transformer and vanilla scheduled sampling on both translation quality and convergence speed.
- Score: 25.406119773503786
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scheduled sampling is an effective method to alleviate the exposure bias
problem of neural machine translation. It simulates the inference scene by
randomly replacing ground-truth target input tokens with predicted ones during
training. Despite its success, its critical schedule strategies are merely
based on training steps, ignoring the real-time model competence, which limits
its potential performance and convergence speed. To address this issue, we
propose confidence-aware scheduled sampling. Specifically, we quantify
real-time model competence by the confidence of model predictions, based on
which we design fine-grained schedule strategies. In this way, the model is
exactly exposed to predicted tokens for high-confidence positions and still
ground-truth tokens for low-confidence positions. Moreover, we observe vanilla
scheduled sampling suffers from degenerating into the original teacher forcing
mode since most predicted tokens are the same as ground-truth tokens.
Therefore, under the above confidence-aware strategy, we further expose more
noisy tokens (e.g., wordy and incorrect word order) instead of predicted ones
for high-confidence token positions. We evaluate our approach on the
Transformer and conduct experiments on large-scale WMT 2014 English-German, WMT
2014 English-French, and WMT 2019 Chinese-English. Results show that our
approach significantly outperforms the Transformer and vanilla scheduled
sampling on both translation quality and convergence speed.
Related papers
- Semformer: Transformer Language Models with Semantic Planning [18.750863564495006]
Next-token prediction serves as the dominant component in current neural language models.
We introduce Semformer, a novel method of training a Transformer language model that explicitly models the semantic planning of response.
arXiv Detail & Related papers (2024-09-17T12:54:34Z) - SMURF-THP: Score Matching-based UnceRtainty quantiFication for
Transformer Hawkes Process [76.98721879039559]
We propose SMURF-THP, a score-based method for learning Transformer Hawkes process and quantifying prediction uncertainty.
Specifically, SMURF-THP learns the score function of events' arrival time based on a score-matching objective.
We conduct extensive experiments in both event type prediction and uncertainty quantification of arrival time.
arXiv Detail & Related papers (2023-10-25T03:33:45Z) - Making Pre-trained Language Models both Task-solvers and
Self-calibrators [52.98858650625623]
Pre-trained language models (PLMs) serve as backbones for various real-world systems.
Previous work shows that introducing an extra calibration task can mitigate this issue.
We propose a training algorithm LM-TOAST to tackle the challenges.
arXiv Detail & Related papers (2023-07-21T02:51:41Z) - How to Estimate Model Transferability of Pre-Trained Speech Models? [84.11085139766108]
"Score-based assessment" framework for estimating transferability of pre-trained speech models.
We leverage upon two representation theories, Bayesian likelihood estimation and optimal transport, to generate rank scores for the PSM candidates.
Our framework efficiently computes transferability scores without actual fine-tuning of candidate models or layers.
arXiv Detail & Related papers (2023-06-01T04:52:26Z) - CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation.
We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts.
Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z) - Dynamic Scheduled Sampling with Imitation Loss for Neural Text
Generation [10.306522595622651]
We introduce Dynamic Scheduled Sampling with Imitation Loss (DySI), which maintains the schedule based solely on the training time accuracy.
DySI achieves notable improvements on standard machine translation benchmarks, and significantly improves the robustness of other text generation models.
arXiv Detail & Related papers (2023-01-31T16:41:06Z) - Learning Confidence for Transformer-based Neural Machine Translation [38.679505127679846]
We propose an unsupervised confidence estimate learning jointly with the training of the neural machine translation (NMT) model.
We explain confidence as how many hints the NMT model needs to make a correct prediction, and more hints indicate low confidence.
We demonstrate that our learned confidence estimate achieves high accuracy on extensive sentence/word-level quality estimation tasks.
arXiv Detail & Related papers (2022-03-22T01:51:58Z) - How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial
Robustness? [121.57551065856164]
We propose Robust Informative Fine-Tuning (RIFT) as a novel adversarial fine-tuning method from an information-theoretical perspective.
RIFT encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process.
Experimental results show that RIFT consistently outperforms the state-of-the-arts on two popular NLP tasks.
arXiv Detail & Related papers (2021-12-22T05:04:41Z) - Token Drop mechanism for Neural Machine Translation [12.666468105300002]
We propose Token Drop to improve generalization and avoid overfitting for the NMT model.
Similar to word dropout, whereas we replace dropped token with a special token instead of setting zero to words.
arXiv Detail & Related papers (2020-10-21T14:02:27Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.