Token-level Adaptive Training for Neural Machine Translation
- URL: http://arxiv.org/abs/2010.04380v1
- Date: Fri, 9 Oct 2020 05:55:05 GMT
- Title: Token-level Adaptive Training for Neural Machine Translation
- Authors: Shuhao Gu, Jinchao Zhang, Fandong Meng, Yang Feng, Wanying Xie, Jie
Zhou, Dong Yu
- Abstract summary: There exists a token imbalance phenomenon in natural language as different tokens appear with different frequencies.
vanilla NMT model usually adopts trivial equal-weighted objectives for target tokens with different frequencies.
Low-frequency tokens may carry critical semantic information that will affect the translation quality once they are neglected.
- Score: 84.69646428587548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There exists a token imbalance phenomenon in natural language as different
tokens appear with different frequencies, which leads to different learning
difficulties for tokens in Neural Machine Translation (NMT). The vanilla NMT
model usually adopts trivial equal-weighted objectives for target tokens with
different frequencies and tends to generate more high-frequency tokens and less
low-frequency tokens compared with the golden token distribution. However,
low-frequency tokens may carry critical semantic information that will affect
the translation quality once they are neglected. In this paper, we explored
target token-level adaptive objectives based on token frequencies to assign
appropriate weights for each target token during training. We aimed that those
meaningful but relatively low-frequency words could be assigned with larger
weights in objectives to encourage the model to pay more attention to these
tokens. Our method yields consistent improvements in translation quality on
ZH-EN, EN-RO, and EN-DE translation tasks, especially on sentences that contain
more low-frequency tokens where we can get 1.68, 1.02, and 0.52 BLEU increases
compared with baseline, respectively. Further analyses show that our method can
also improve the lexical diversity of translation.
Related papers
- LBPE: Long-token-first Tokenization to Improve Large Language Models [26.3619552256488]
Long tokens, rich in semantic information, have fewer occurrences in tokenized datasets compared to short tokens.
We propose LBPE, which prioritizes long tokens during the encoding process.
Experiments across diverse language modeling tasks demonstrate that LBPE consistently outperforms the original BPE.
arXiv Detail & Related papers (2024-11-08T12:03:36Z) - The Fair Language Model Paradox [19.439996884827448]
Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level.
We show that as weight decay increases, low-frequency tokens are disproportionately depreciated.
This is particularly concerning, as these neglected low-frequency tokens represent the vast majority of the token distribution in most languages.
arXiv Detail & Related papers (2024-10-15T18:47:12Z) - Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal [58.29382184006158]
We propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE method.
On extensive experiments across language modeling and even machine translation, Scaffold-BPE consistently outperforms the original BPE.
arXiv Detail & Related papers (2024-04-27T07:12:07Z) - Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs)
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies.
We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z) - A Simple Contrastive Learning Objective for Alleviating Neural Text
Degeneration [56.64703901898937]
We propose a new contrastive token learning objective that inherits the advantages of cross-entropy and unlikelihood training.
Comprehensive experiments on language modeling and open-domain dialogue generation tasks show that the proposed contrastive token objective yields less repetitive texts.
arXiv Detail & Related papers (2022-05-05T08:50:50Z) - How Robust is Neural Machine Translation to Language Imbalance in
Multilingual Tokenizer Training? [86.48323488619629]
We analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus.
We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected.
arXiv Detail & Related papers (2022-04-29T17:50:36Z) - Frequency-Aware Contrastive Learning for Neural Machine Translation [24.336356651877388]
Low-frequency word prediction remains a challenge in modern neural machine translation (NMT) systems.
Inspired by the observation that low-frequency words form a more compact embedding space, we tackle this challenge from a representation learning perspective.
We propose a frequency-aware token-level contrastive learning method, in which the hidden state of each decoding step is pushed away from the counterparts of other target words.
arXiv Detail & Related papers (2021-12-29T10:10:10Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Bilingual Mutual Information Based Adaptive Training for Neural Machine
Translation [38.83163343372786]
We propose a novel bilingual mutual information (BMI) based adaptive objective, which measures the learning difficulty for each target token from the perspective of bilingualism.
Experimental results on WMT14 English-to-German and WMT19 Chinese-to-English demonstrate the superiority of our approach compared with the Transformer baseline and previous token-level adaptive training approaches.
arXiv Detail & Related papers (2021-05-26T12:54:24Z) - Token Drop mechanism for Neural Machine Translation [12.666468105300002]
We propose Token Drop to improve generalization and avoid overfitting for the NMT model.
Similar to word dropout, whereas we replace dropped token with a special token instead of setting zero to words.
arXiv Detail & Related papers (2020-10-21T14:02:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.