DeltaLM: Encoder-Decoder Pre-training for Language Generation and
Translation by Augmenting Pretrained Multilingual Encoders
- URL: http://arxiv.org/abs/2106.13736v1
- Date: Fri, 25 Jun 2021 16:12:10 GMT
- Title: DeltaLM: Encoder-Decoder Pre-training for Language Generation and
Translation by Augmenting Pretrained Multilingual Encoders
- Authors: Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Alexandre Muzio,
Saksham Singhal, Hany Hassan Awadalla, Xia Song, Furu Wei
- Abstract summary: We introduce DeltaLM, a pretrained multilingual encoder-decoder model.
Specifically, we augment the pretrained multilingual encoder with a decoder and pre-train it in a self-supervised way.
Experiments show that DeltaLM outperforms various strong baselines on both natural language generation and translation tasks.
- Score: 92.90543340071007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While pretrained encoders have achieved success in various natural language
understanding (NLU) tasks, there is a gap between these pretrained encoders and
natural language generation (NLG). NLG tasks are often based on the
encoder-decoder framework, where the pretrained encoders can only benefit part
of it. To reduce this gap, we introduce DeltaLM, a pretrained multilingual
encoder-decoder model that regards the decoder as the task layer of
off-the-shelf pretrained encoders. Specifically, we augment the pretrained
multilingual encoder with a decoder and pre-train it in a self-supervised way.
To take advantage of both the large-scale monolingual data and bilingual data,
we adopt the span corruption and translation span corruption as the
pre-training tasks. Experiments show that DeltaLM outperforms various strong
baselines on both natural language generation and translation tasks, including
machine translation, abstractive text summarization, data-to-text, and question
generation.
Related papers
- Decoder-Only or Encoder-Decoder? Interpreting Language Model as a
Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence.
Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text.
Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z) - Is Encoder-Decoder Redundant for Neural Machine Translation? [44.37101354412253]
encoder-decoder architecture is still the de facto neural network architecture for state-of-the-art models.
In this work, we experiment with bilingual translation, translation with additional target monolingual data, and multilingual translation.
This alternative approach performs on par with the baseline encoder-decoder Transformer, suggesting that an encoder-decoder architecture might be redundant for neural machine translation.
arXiv Detail & Related papers (2022-10-21T08:33:55Z) - Multilingual Neural Machine Translation with Deep Encoder and Multiple
Shallow Decoders [77.2101943305862]
We propose a deep encoder with multiple shallow decoders (DEMSD) where each shallow decoder is responsible for a disjoint subset of target languages.
DEMSD model with 2-layer decoders is able to obtain a 1.8x speedup on average compared to a standard transformer model with no drop in translation quality.
arXiv Detail & Related papers (2022-06-05T01:15:04Z) - CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language
Understanding and Generation [38.02741711554989]
Chinese Pre-trained Unbalanced Transformer (CPT) is designed for both natural language understanding (NLU) and natural language generation (NLG) tasks.
CPT consists of three parts: a shared encoder, an understanding decoder, and a generation decoder.
With partially shared architecture and multi-task pre-training, CPT can learn specific knowledge of both NLU or NLG tasks with two decoders.
arXiv Detail & Related papers (2021-09-13T06:25:45Z) - Guiding Teacher Forcing with Seer Forcing for Neural Machine Translation [11.570746514243117]
We introduce another decoder, called seer decoder, into the encoder-decoder framework during training.
We force the conventional decoder to simulate the behaviors of the seer decoder via knowledge distillation.
Experiments show our method can outperform competitive baselines significantly and achieves greater improvements on the bigger data sets.
arXiv Detail & Related papers (2021-06-12T11:38:40Z) - Scheduled Sampling in Vision-Language Pretraining with Decoupled
Encoder-Decoder Network [99.03895740754402]
We propose a two-stream decoupled design of encoder-decoder structure, in which two decoupled cross-modal encoder and decoder are involved.
As an alternative, we propose a primary scheduled sampling strategy that mitigates such discrepancy via pretraining encoder-decoder in a two-pass manner.
arXiv Detail & Related papers (2021-01-27T17:36:57Z) - On the Sub-Layer Functionalities of Transformer Decoder [74.83087937309266]
We study how Transformer-based decoders leverage information from the source and target languages.
Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance.
arXiv Detail & Related papers (2020-10-06T11:50:54Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.