Softmax Tempering for Training Neural Machine Translation Models
- URL: http://arxiv.org/abs/2009.09372v1
- Date: Sun, 20 Sep 2020 07:06:22 GMT
- Title: Softmax Tempering for Training Neural Machine Translation Models
- Authors: Raj Dabre and Atsushi Fujita
- Abstract summary: We propose to divide the logits by a temperature coefficient, prior to applying softmax, during training.
In experiments on 11 language pairs, we observed significant improvements in translation quality by up to 3.9 BLEU points.
We also study the impact of softmax tempering on multilingual NMT and recurrently stacked NMT.
- Score: 24.00130933505408
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural machine translation (NMT) models are typically trained using a softmax
cross-entropy loss where the softmax distribution is compared against smoothed
gold labels. In low-resource scenarios, NMT models tend to over-fit because the
softmax distribution quickly approaches the gold label distribution. To address
this issue, we propose to divide the logits by a temperature coefficient, prior
to applying softmax, during training. In our experiments on 11 language pairs
in the Asian Language Treebank dataset and the WMT 2019 English-to-German
translation task, we observed significant improvements in translation quality
by up to 3.9 BLEU points. Furthermore, softmax tempering makes the greedy
search to be as good as beam search decoding in terms of translation quality,
enabling 1.5 to 3.5 times speed-up. We also study the impact of softmax
tempering on multilingual NMT and recurrently stacked NMT, both of which aim to
reduce the NMT model size by parameter sharing thereby verifying the utility of
temperature in developing compact NMT models. Finally, an analysis of softmax
entropies and gradients reveal the impact of our method on the internal
behavior of NMT models.
Related papers
- Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks.
We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset.
The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z) - Exploiting Language Relatedness in Machine Translation Through Domain
Adaptation Techniques [3.257358540764261]
We present a novel approach of using a scaled similarity score of sentences, especially for related languages based on a 5-gram KenLM language model.
Our approach succeeds in increasing 2 BLEU point on multi-domain approach, 3 BLEU point on fine-tuning for NMT and 2 BLEU point on iterative back-translation approach.
arXiv Detail & Related papers (2023-03-03T09:07:30Z) - Better Datastore, Better Translation: Generating Datastores from
Pre-Trained Models for Nearest Neural Machine Translation [48.58899349349702]
Nearest Neighbor Machine Translation (kNNMT) is a simple and effective method of augmenting neural machine translation (NMT) with a token-level nearest neighbor retrieval mechanism.
In this paper, we propose PRED, a framework that leverages Pre-trained models for Datastores in kNN-MT.
arXiv Detail & Related papers (2022-12-17T08:34:20Z) - Improving Simultaneous Machine Translation with Monolingual Data [94.1085601198393]
Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model.
We propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD.
arXiv Detail & Related papers (2022-12-02T14:13:53Z) - Jam or Cream First? Modeling Ambiguity in Neural Machine Translation
with SCONES [10.785577504399077]
We propose to replace the softmax activation with a multi-label classification layer that can model ambiguity more effectively.
We show that the multi-label output layer can still be trained on single reference training data using the SCONES loss function.
We demonstrate that SCONES can be used to train NMT models that assign the highest probability to adequate translations.
arXiv Detail & Related papers (2022-05-02T07:51:37Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Reward Optimization for Neural Machine Translation with Learned Metrics [18.633477083783248]
We investigate whether it is beneficial to optimize Neural machine translation (NMT) models with the state-of-the-art model-based metric, BLEURT.
Results show that the reward optimization with BLEURT is able to increase the metric scores by a large margin, in contrast to limited gain when training with smoothed BLEU.
arXiv Detail & Related papers (2021-04-15T15:53:31Z) - Translating the Unseen? Yor\`ub\'a $\rightarrow$ English MT in
Low-Resource, Morphologically-Unmarked Settings [8.006185289499049]
Translating between languages where certain features are marked morphologically in one but absent or marked contextually in the other is an important test case for machine translation.
In this work, we perform fine-grained analysis on how an SMT system compares with two NMT systems when translating bare nouns in Yorub'a into English.
arXiv Detail & Related papers (2021-03-07T01:24:09Z) - Improving Target-side Lexical Transfer in Multilingual Neural Machine
Translation [104.10726545151043]
multilingual data has been found more beneficial for NMT models that translate from the LRL to a target language than the ones that translate into the LRLs.
Our experiments show that DecSDE leads to consistent gains of up to 1.8 BLEU on translation from English to four different languages.
arXiv Detail & Related papers (2020-10-04T19:42:40Z) - On the Inference Calibration of Neural Machine Translation [54.48932804996506]
We study the correlation between calibration and translation performance and linguistic properties of miscalibration.
We propose a new graduated label smoothing method that can improve both inference calibration and translation performance.
arXiv Detail & Related papers (2020-05-03T02:03:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.