Jam or Cream First? Modeling Ambiguity in Neural Machine Translation
with SCONES
- URL: http://arxiv.org/abs/2205.00704v1
- Date: Mon, 2 May 2022 07:51:37 GMT
- Title: Jam or Cream First? Modeling Ambiguity in Neural Machine Translation
with SCONES
- Authors: Felix Stahlberg and Shankar Kumar
- Abstract summary: We propose to replace the softmax activation with a multi-label classification layer that can model ambiguity more effectively.
We show that the multi-label output layer can still be trained on single reference training data using the SCONES loss function.
We demonstrate that SCONES can be used to train NMT models that assign the highest probability to adequate translations.
- Score: 10.785577504399077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The softmax layer in neural machine translation is designed to model the
distribution over mutually exclusive tokens. Machine translation, however, is
intrinsically uncertain: the same source sentence can have multiple
semantically equivalent translations. Therefore, we propose to replace the
softmax activation with a multi-label classification layer that can model
ambiguity more effectively. We call our loss function Single-label Contrastive
Objective for Non-Exclusive Sequences (SCONES). We show that the multi-label
output layer can still be trained on single reference training data using the
SCONES loss function. SCONES yields consistent BLEU score gains across six
translation directions, particularly for medium-resource language pairs and
small beam sizes. By using smaller beam sizes we can speed up inference by a
factor of 3.9x and still match or improve the BLEU score obtained using
softmax. Furthermore, we demonstrate that SCONES can be used to train NMT
models that assign the highest probability to adequate translations, thus
mitigating the "beam search curse". Additional experiments on synthetic
language pairs with varying levels of uncertainty suggest that the improvements
from SCONES can be attributed to better handling of ambiguity.
Related papers
- Can the Variation of Model Weights be used as a Criterion for Self-Paced Multilingual NMT? [7.330978520551704]
Many-to-one neural machine translation systems improve over one-to-one systems when training data is scarce.
In this paper, we design and test a novel algorithm for selecting the language of minibatches when training such systems.
arXiv Detail & Related papers (2024-10-05T12:52:51Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - Focus on the Target's Vocabulary: Masked Label Smoothing for Machine
Translation [25.781293857729864]
Masked Label Smoothing (MLS) is a new mechanism that masks the soft label probability of source-side words to zero.
Our experiments show that MLS consistently yields improvement over original label smoothing on different datasets.
arXiv Detail & Related papers (2022-03-06T07:01:39Z) - Anticipation-free Training for Simultaneous Translation [70.85761141178597]
Simultaneous translation (SimulMT) speeds up the translation process by starting to translate before the source sentence is completely available.
Existing methods increase latency or introduce adaptive read-write policies for SimulMT models to handle local reordering and improve translation quality.
We propose a new framework that decomposes the translation process into the monotonic translation step and the reordering step.
arXiv Detail & Related papers (2022-01-30T16:29:37Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z) - Modelling Latent Translations for Cross-Lingual Transfer [47.61502999819699]
We propose a new technique that integrates both steps of the traditional pipeline (translation and classification) into a single model.
We evaluate our novel latent translation-based model on a series of multilingual NLU tasks.
We report gains for both zero-shot and few-shot learning setups, up to 2.7 accuracy points on average.
arXiv Detail & Related papers (2021-07-23T17:11:27Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z) - Smoothing and Shrinking the Sparse Seq2Seq Search Space [2.1828601975620257]
We show that entmax-based models effectively solve the cat got your tongue problem.
We also generalize label smoothing to the broader family of Fenchel-Young losses.
Our resulting label-smoothed entmax loss models set a new state of the art on multilingual grapheme-to-phoneme conversion.
arXiv Detail & Related papers (2021-03-18T14:45:38Z) - It's Easier to Translate out of English than into it: Measuring Neural
Translation Difficulty by Cross-Mutual Information [90.35685796083563]
Cross-mutual information (XMI) is an asymmetric information-theoretic metric of machine translation difficulty.
XMI exploits the probabilistic nature of most neural machine translation models.
We present the first systematic and controlled study of cross-lingual translation difficulties using modern neural translation systems.
arXiv Detail & Related papers (2020-05-05T17:38:48Z) - Multi-layer Representation Fusion for Neural Machine Translation [38.12309528346962]
We propose a multi-layer representation fusion (MLRF) approach to fusing stacked layers.
In particular, we design three fusion functions to learn a better representation from the stack.
The result is new state-of-the-art in German-English translation.
arXiv Detail & Related papers (2020-02-16T23:53:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.