On the Copying Behaviors of Pre-Training for Neural Machine Translation
- URL: http://arxiv.org/abs/2107.08212v1
- Date: Sat, 17 Jul 2021 10:02:30 GMT
- Title: On the Copying Behaviors of Pre-Training for Neural Machine Translation
- Authors: Xuebo Liu, Longyue Wang, Derek F. Wong, Liang Ding, Lidia S. Chao,
Shuming Shi, Zhaopeng Tu
- Abstract summary: Previous studies have shown that initializing neural machine translation (NMT) models with the pre-trained language models (LM) can speed up the model training and boost the model performance.
In this work, we identify a critical side-effect of pre-training for NMT, which is due to the discrepancy between the training objectives of LM-based pre-training and NMT.
We propose a simple and effective method named copying penalty to control the copying behaviors in decoding.
- Score: 63.914940899327966
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous studies have shown that initializing neural machine translation
(NMT) models with the pre-trained language models (LM) can speed up the model
training and boost the model performance. In this work, we identify a critical
side-effect of pre-training for NMT, which is due to the discrepancy between
the training objectives of LM-based pre-training and NMT. Since the LM
objective learns to reconstruct a few source tokens and copy most of them, the
pre-training initialization would affect the copying behaviors of NMT models.
We provide a quantitative analysis of copying behaviors by introducing a metric
called copying ratio, which empirically shows that pre-training based NMT
models have a larger copying ratio than the standard one. In response to this
problem, we propose a simple and effective method named copying penalty to
control the copying behaviors in decoding. Extensive experiments on both
in-domain and out-of-domain benchmarks show that the copying penalty method
consistently improves translation performance by controlling copying behaviors
for pre-training based NMT models. Source code is freely available at
https://github.com/SunbowLiu/CopyingPenalty.
Related papers
- Language Models "Grok" to Copy [36.50007948478452]
We examine the pre-training dynamics of language models, focusing on their ability to copy text from preceding context.
We propose a novel perspective that Transformer-based language models develop copying abilities similarly to grokking.
We contend that the connection between grokking and context copying can provide valuable insights for more effective language model training.
arXiv Detail & Related papers (2024-09-14T03:11:00Z) - A Scalable and Efficient Iterative Method for Copying Machine Learning
Classifiers [0.802904964931021]
This paper introduces a novel sequential approach that significantly reduces the amount of computational resources needed to train or maintain a copy of a machine learning model.
The effectiveness of the sequential approach is demonstrated through experiments with synthetic and real-world datasets, showing significant reductions in time and resources, while maintaining or improving accuracy.
arXiv Detail & Related papers (2023-02-06T10:07:41Z) - Masked Autoencoders As The Unified Learners For Pre-Trained Sentence
Representation [77.47617360812023]
We extend the recently proposed MAE style pre-training strategy, RetroMAE, to support a wide variety of sentence representation tasks.
The first stage performs RetroMAE over generic corpora, like Wikipedia, BookCorpus, etc., from which the base model is learned.
The second stage takes place on domain-specific data, e.g., MS MARCO and NLI, where the base model is continuingly trained based on RetroMAE and contrastive learning.
arXiv Detail & Related papers (2022-07-30T14:34:55Z) - End-to-End Training for Back-Translation with Categorical Reparameterization Trick [0.0]
Back-translation is an effective semi-supervised learning framework in neural machine translation (NMT)
A pre-trained NMT model translates monolingual sentences and makes synthetic bilingual sentence pairs for the training of the other NMT model.
The discrete property of translated sentences prevents information gradient from flowing between the two NMT models.
arXiv Detail & Related papers (2022-02-17T06:31:03Z) - Language Modeling, Lexical Translation, Reordering: The Training Process
of NMT through the Lens of Classical SMT [64.1841519527504]
neural machine translation uses a single neural network to model the entire translation process.
Despite neural machine translation being de-facto standard, it is still not clear how NMT models acquire different competences over the course of training.
arXiv Detail & Related papers (2021-09-03T09:38:50Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of
Pre-trained Models' Transferability [74.11825654535895]
We investigate whether the power of the models pre-trained on text data, such as BERT, can be transferred to general token sequence classification applications.
We find that even on non-text data, the models pre-trained on text converge faster than the randomly models.
arXiv Detail & Related papers (2021-03-12T09:19:14Z) - LogME: Practical Assessment of Pre-trained Models for Transfer Learning [80.24059713295165]
The Logarithm of Maximum Evidence (LogME) can be used to assess pre-trained models for transfer learning.
Compared to brute-force fine-tuning, LogME brings over $3000times$ speedup in wall-clock time.
arXiv Detail & Related papers (2021-02-22T13:58:11Z) - Reinforced Curriculum Learning on Pre-trained Neural Machine Translation
Models [20.976165305749777]
We learn a curriculum for improving a pre-trained NMT model by re-selecting influential data samples from the original training set.
We propose a data selection framework based on Deterministic Actor-Critic, in which a critic network predicts the expected change of model performance.
arXiv Detail & Related papers (2020-04-13T03:40:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.