Related papers: Confidence-Aware Scheduled Sampling for Neural Machine Translation

Confidence-Aware Scheduled Sampling for Neural Machine Translation

URL: http://arxiv.org/abs/2107.10427v1
Date: Thu, 22 Jul 2021 02:49:04 GMT
Title: Confidence-Aware Scheduled Sampling for Neural Machine Translation
Authors: Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu and Jie Zhou
Abstract summary: We propose confidence-aware scheduled sampling for neural machine translation. We quantify real-time model competence by the confidence of model predictions. Our approach significantly outperforms the Transformer and vanilla scheduled sampling on both translation quality and convergence speed.
Score: 25.406119773503786
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scheduled sampling is an effective method to alleviate the exposure bias problem of neural machine translation. It simulates the inference scene by randomly replacing ground-truth target input tokens with predicted ones during training. Despite its success, its critical schedule strategies are merely based on training steps, ignoring the real-time model competence, which limits its potential performance and convergence speed. To address this issue, we propose confidence-aware scheduled sampling. Specifically, we quantify real-time model competence by the confidence of model predictions, based on which we design fine-grained schedule strategies. In this way, the model is exactly exposed to predicted tokens for high-confidence positions and still ground-truth tokens for low-confidence positions. Moreover, we observe vanilla scheduled sampling suffers from degenerating into the original teacher forcing mode since most predicted tokens are the same as ground-truth tokens. Therefore, under the above confidence-aware strategy, we further expose more noisy tokens (e.g., wordy and incorrect word order) instead of predicted ones for high-confidence token positions. We evaluate our approach on the Transformer and conduct experiments on large-scale WMT 2014 English-German, WMT 2014 English-French, and WMT 2019 Chinese-English. Results show that our approach significantly outperforms the Transformer and vanilla scheduled sampling on both translation quality and convergence speed.

Related papers

Cautious Next Token Prediction [62.74127603725369]
We propose a new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP)<n>In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation.<n>We show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin.
arXiv Detail & Related papers (2025-07-03T05:49:18Z)
Semformer: Transformer Language Models with Semantic Planning [18.750863564495006]
Next-token prediction serves as the dominant component in current neural language models. We introduce Semformer, a novel method of training a Transformer language model that explicitly models the semantic planning of response.
arXiv Detail & Related papers (2024-09-17T12:54:34Z)
SMURF-THP: Score Matching-based UnceRtainty quantiFication for Transformer Hawkes Process [76.98721879039559]
We propose SMURF-THP, a score-based method for learning Transformer Hawkes process and quantifying prediction uncertainty. Specifically, SMURF-THP learns the score function of events' arrival time based on a score-matching objective. We conduct extensive experiments in both event type prediction and uncertainty quantification of arrival time.
arXiv Detail & Related papers (2023-10-25T03:33:45Z)
Making Pre-trained Language Models both Task-solvers and Self-calibrators [52.98858650625623]
Pre-trained language models (PLMs) serve as backbones for various real-world systems. Previous work shows that introducing an extra calibration task can mitigate this issue. We propose a training algorithm LM-TOAST to tackle the challenges.
arXiv Detail & Related papers (2023-07-21T02:51:41Z)
How to Estimate Model Transferability of Pre-Trained Speech Models? [84.11085139766108]
"Score-based assessment" framework for estimating transferability of pre-trained speech models. We leverage upon two representation theories, Bayesian likelihood estimation and optimal transport, to generate rank scores for the PSM candidates. Our framework efficiently computes transferability scores without actual fine-tuning of candidate models or layers.
arXiv Detail & Related papers (2023-06-01T04:52:26Z)
CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation. We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts. Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z)
Dynamic Scheduled Sampling with Imitation Loss for Neural Text Generation [10.306522595622651]
We introduce Dynamic Scheduled Sampling with Imitation Loss (DySI), which maintains the schedule based solely on the training time accuracy. DySI achieves notable improvements on standard machine translation benchmarks, and significantly improves the robustness of other text generation models.
arXiv Detail & Related papers (2023-01-31T16:41:06Z)
Learning Confidence for Transformer-based Neural Machine Translation [38.679505127679846]
We propose an unsupervised confidence estimate learning jointly with the training of the neural machine translation (NMT) model. We explain confidence as how many hints the NMT model needs to make a correct prediction, and more hints indicate low confidence. We demonstrate that our learned confidence estimate achieves high accuracy on extensive sentence/word-level quality estimation tasks.
arXiv Detail & Related papers (2022-03-22T01:51:58Z)
How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness? [121.57551065856164]
We propose Robust Informative Fine-Tuning (RIFT) as a novel adversarial fine-tuning method from an information-theoretical perspective. RIFT encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process. Experimental results show that RIFT consistently outperforms the state-of-the-arts on two popular NLP tasks.
arXiv Detail & Related papers (2021-12-22T05:04:41Z)
Token Drop mechanism for Neural Machine Translation [12.666468105300002]
We propose Token Drop to improve generalization and avoid overfitting for the NMT model. Similar to word dropout, whereas we replace dropped token with a special token instead of setting zero to words.
arXiv Detail & Related papers (2020-10-21T14:02:27Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.