Guiding Teacher Forcing with Seer Forcing for Neural Machine Translation
- URL: http://arxiv.org/abs/2106.06751v1
- Date: Sat, 12 Jun 2021 11:38:40 GMT
- Title: Guiding Teacher Forcing with Seer Forcing for Neural Machine Translation
- Authors: Yang Feng, Shuhao Gu, Dengji Guo, Zhengxin Yang, Chenze Shao
- Abstract summary: We introduce another decoder, called seer decoder, into the encoder-decoder framework during training.
We force the conventional decoder to simulate the behaviors of the seer decoder via knowledge distillation.
Experiments show our method can outperform competitive baselines significantly and achieves greater improvements on the bigger data sets.
- Score: 11.570746514243117
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Although teacher forcing has become the main training paradigm for neural
machine translation, it usually makes predictions only conditioned on past
information, and hence lacks global planning for the future. To address this
problem, we introduce another decoder, called seer decoder, into the
encoder-decoder framework during training, which involves future information in
target predictions. Meanwhile, we force the conventional decoder to simulate
the behaviors of the seer decoder via knowledge distillation. In this way, at
test the conventional decoder can perform like the seer decoder without the
attendance of it. Experiment results on the Chinese-English, English-German and
English-Romanian translation tasks show our method can outperform competitive
baselines significantly and achieves greater improvements on the bigger data
sets. Besides, the experiments also prove knowledge distillation the best way
to transfer knowledge from the seer decoder to the conventional decoder
compared to adversarial learning and L2 regularization.
Related papers
- Think Twice before Driving: Towards Scalable Decoders for End-to-End
Autonomous Driving [74.28510044056706]
Existing methods usually adopt the decoupled encoder-decoder paradigm.
In this work, we aim to alleviate the problem by two principles.
We first predict a coarse-grained future position and action based on the encoder features.
Then, conditioned on the position and action, the future scene is imagined to check the ramification if we drive accordingly.
arXiv Detail & Related papers (2023-05-10T15:22:02Z) - Decoder-Only or Encoder-Decoder? Interpreting Language Model as a
Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence.
Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text.
Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Look Backward and Forward: Self-Knowledge Distillation with
Bidirectional Decoder for Neural Machine Translation [9.279287354043289]
Self-Knowledge Distillation with Bidirectional Decoder for Neural Machine Translation(SBD-NMT)
We deploy a backward decoder which can act as an effective regularization method to the forward decoder.
Experiments show that our method is significantly better than the strong Transformer baselines on multiple machine translation data sets.
arXiv Detail & Related papers (2022-03-10T09:21:28Z) - Adversarial Neural Networks for Error Correcting Codes [76.70040964453638]
We introduce a general framework to boost the performance and applicability of machine learning (ML) models.
We propose to combine ML decoders with a competing discriminator network that tries to distinguish between codewords and noisy words.
Our framework is game-theoretic, motivated by generative adversarial networks (GANs)
arXiv Detail & Related papers (2021-12-21T19:14:44Z) - DeltaLM: Encoder-Decoder Pre-training for Language Generation and
Translation by Augmenting Pretrained Multilingual Encoders [92.90543340071007]
We introduce DeltaLM, a pretrained multilingual encoder-decoder model.
Specifically, we augment the pretrained multilingual encoder with a decoder and pre-train it in a self-supervised way.
Experiments show that DeltaLM outperforms various strong baselines on both natural language generation and translation tasks.
arXiv Detail & Related papers (2021-06-25T16:12:10Z) - Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained
Models into Speech Translation Encoders [30.160261563657947]
Speech-to-translation data is scarce; pre-training is promising in end-to-end Speech Translation.
We propose a Stacked.
Acoustic-and-Textual (SATE) method for speech translation.
Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an.
MT encoder for a global representation of the input sequence.
arXiv Detail & Related papers (2021-05-12T16:09:53Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.