Look Backward and Forward: Self-Knowledge Distillation with
Bidirectional Decoder for Neural Machine Translation
- URL: http://arxiv.org/abs/2203.05248v2
- Date: Fri, 11 Mar 2022 01:57:26 GMT
- Title: Look Backward and Forward: Self-Knowledge Distillation with
Bidirectional Decoder for Neural Machine Translation
- Authors: Xuanwei Zhang and Libin Shen and Disheng Pan and Liang Wang and Yanjun
Miao
- Abstract summary: Self-Knowledge Distillation with Bidirectional Decoder for Neural Machine Translation(SBD-NMT)
We deploy a backward decoder which can act as an effective regularization method to the forward decoder.
Experiments show that our method is significantly better than the strong Transformer baselines on multiple machine translation data sets.
- Score: 9.279287354043289
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural Machine Translation(NMT) models are usually trained via unidirectional
decoder which corresponds to optimizing one-step-ahead prediction. However,
this kind of unidirectional decoding framework may incline to focus on local
structure rather than global coherence. To alleviate this problem, we propose a
novel method, Self-Knowledge Distillation with Bidirectional Decoder for Neural
Machine Translation(SBD-NMT). We deploy a backward decoder which can act as an
effective regularization method to the forward decoder. By leveraging the
backward decoder's information about the longer-term future, distilling
knowledge learned in the backward decoder can encourage auto-regressive NMT
models to plan ahead. Experiments show that our method is significantly better
than the strong Transformer baselines on multiple machine translation data
sets.
Related papers
- Think Twice before Driving: Towards Scalable Decoders for End-to-End
Autonomous Driving [74.28510044056706]
Existing methods usually adopt the decoupled encoder-decoder paradigm.
In this work, we aim to alleviate the problem by two principles.
We first predict a coarse-grained future position and action based on the encoder features.
Then, conditioned on the position and action, the future scene is imagined to check the ramification if we drive accordingly.
arXiv Detail & Related papers (2023-05-10T15:22:02Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Back from the future: bidirectional CTC decoding using future
information in speech recognition [3.386091225912298]
We propose a simple but effective method to decode the output of the Connectionist Temporal Temporal (CTC) model using a bi-directional neural language model.
The proposed method based on bi-directional beam search takes advantage of the CTC greedy decoding output to represent the noisy future information.
arXiv Detail & Related papers (2021-10-07T10:42:02Z) - DeltaLM: Encoder-Decoder Pre-training for Language Generation and
Translation by Augmenting Pretrained Multilingual Encoders [92.90543340071007]
We introduce DeltaLM, a pretrained multilingual encoder-decoder model.
Specifically, we augment the pretrained multilingual encoder with a decoder and pre-train it in a self-supervised way.
Experiments show that DeltaLM outperforms various strong baselines on both natural language generation and translation tasks.
arXiv Detail & Related papers (2021-06-25T16:12:10Z) - Guiding Teacher Forcing with Seer Forcing for Neural Machine Translation [11.570746514243117]
We introduce another decoder, called seer decoder, into the encoder-decoder framework during training.
We force the conventional decoder to simulate the behaviors of the seer decoder via knowledge distillation.
Experiments show our method can outperform competitive baselines significantly and achieves greater improvements on the bigger data sets.
arXiv Detail & Related papers (2021-06-12T11:38:40Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z) - On the Sub-Layer Functionalities of Transformer Decoder [74.83087937309266]
We study how Transformer-based decoders leverage information from the source and target languages.
Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance.
arXiv Detail & Related papers (2020-10-06T11:50:54Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.