Related papers: Softmax Tempering for Training Neural Machine Translation Models

Softmax Tempering for Training Neural Machine Translation Models

URL: http://arxiv.org/abs/2009.09372v1
Date: Sun, 20 Sep 2020 07:06:22 GMT
Title: Softmax Tempering for Training Neural Machine Translation Models
Authors: Raj Dabre and Atsushi Fujita
Abstract summary: We propose to divide the logits by a temperature coefficient, prior to applying softmax, during training. In experiments on 11 language pairs, we observed significant improvements in translation quality by up to 3.9 BLEU points. We also study the impact of softmax tempering on multilingual NMT and recurrently stacked NMT.
Score: 24.00130933505408
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural machine translation (NMT) models are typically trained using a softmax cross-entropy loss where the softmax distribution is compared against smoothed gold labels. In low-resource scenarios, NMT models tend to over-fit because the softmax distribution quickly approaches the gold label distribution. To address this issue, we propose to divide the logits by a temperature coefficient, prior to applying softmax, during training. In our experiments on 11 language pairs in the Asian Language Treebank dataset and the WMT 2019 English-to-German translation task, we observed significant improvements in translation quality by up to 3.9 BLEU points. Furthermore, softmax tempering makes the greedy search to be as good as beam search decoding in terms of translation quality, enabling 1.5 to 3.5 times speed-up. We also study the impact of softmax tempering on multilingual NMT and recurrently stacked NMT, both of which aim to reduce the NMT model size by parameter sharing thereby verifying the utility of temperature in developing compact NMT models. Finally, an analysis of softmax entropies and gradients reveal the impact of our method on the internal behavior of NMT models.

Related papers

Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks. We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset. The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z)
Exploiting Language Relatedness in Machine Translation Through Domain Adaptation Techniques [3.257358540764261]
We present a novel approach of using a scaled similarity score of sentences, especially for related languages based on a 5-gram KenLM language model. Our approach succeeds in increasing 2 BLEU point on multi-domain approach, 3 BLEU point on fine-tuning for NMT and 2 BLEU point on iterative back-translation approach.
arXiv Detail & Related papers (2023-03-03T09:07:30Z)
Better Datastore, Better Translation: Generating Datastores from Pre-Trained Models for Nearest Neural Machine Translation [48.58899349349702]
Nearest Neighbor Machine Translation (kNNMT) is a simple and effective method of augmenting neural machine translation (NMT) with a token-level nearest neighbor retrieval mechanism. In this paper, we propose PRED, a framework that leverages Pre-trained models for Datastores in kNN-MT.
arXiv Detail & Related papers (2022-12-17T08:34:20Z)
Improving Simultaneous Machine Translation with Monolingual Data [94.1085601198393]
Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model. We propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD.
arXiv Detail & Related papers (2022-12-02T14:13:53Z)
Jam or Cream First? Modeling Ambiguity in Neural Machine Translation with SCONES [10.785577504399077]
We propose to replace the softmax activation with a multi-label classification layer that can model ambiguity more effectively. We show that the multi-label output layer can still be trained on single reference training data using the SCONES loss function. We demonstrate that SCONES can be used to train NMT models that assign the highest probability to adequate translations.
arXiv Detail & Related papers (2022-05-02T07:51:37Z)
Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT) Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder. We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z)
Reward Optimization for Neural Machine Translation with Learned Metrics [18.633477083783248]
We investigate whether it is beneficial to optimize Neural machine translation (NMT) models with the state-of-the-art model-based metric, BLEURT. Results show that the reward optimization with BLEURT is able to increase the metric scores by a large margin, in contrast to limited gain when training with smoothed BLEU.
arXiv Detail & Related papers (2021-04-15T15:53:31Z)
Translating the Unseen? Yor\`ub\'a $\rightarrow$ English MT in Low-Resource, Morphologically-Unmarked Settings [8.006185289499049]
Translating between languages where certain features are marked morphologically in one but absent or marked contextually in the other is an important test case for machine translation. In this work, we perform fine-grained analysis on how an SMT system compares with two NMT systems when translating bare nouns in Yorub'a into English.
arXiv Detail & Related papers (2021-03-07T01:24:09Z)
Improving Target-side Lexical Transfer in Multilingual Neural Machine Translation [104.10726545151043]
multilingual data has been found more beneficial for NMT models that translate from the LRL to a target language than the ones that translate into the LRLs. Our experiments show that DecSDE leads to consistent gains of up to 1.8 BLEU on translation from English to four different languages.
arXiv Detail & Related papers (2020-10-04T19:42:40Z)
On the Inference Calibration of Neural Machine Translation [54.48932804996506]
We study the correlation between calibration and translation performance and linguistic properties of miscalibration. We propose a new graduated label smoothing method that can improve both inference calibration and translation performance.
arXiv Detail & Related papers (2020-05-03T02:03:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.