Decoder-Only or Encoder-Decoder? Interpreting Language Model as a
Regularized Encoder-Decoder
- URL: http://arxiv.org/abs/2304.04052v1
- Date: Sat, 8 Apr 2023 15:44:29 GMT
- Title: Decoder-Only or Encoder-Decoder? Interpreting Language Model as a
Regularized Encoder-Decoder
- Authors: Zihao Fu, Wai Lam, Qian Yu, Anthony Man-Cho So, Shengding Hu, Zhiyuan
Liu, Nigel Collier
- Abstract summary: The seq2seq task aims at generating the target sequence based on the given input source sequence.
Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text.
Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
- Score: 75.03283861464365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The sequence-to-sequence (seq2seq) task aims at generating the target
sequence based on the given input source sequence. Traditionally, most of the
seq2seq task is resolved by the Encoder-Decoder framework which requires an
encoder to encode the source sequence and a decoder to generate the target
text. Recently, a bunch of new approaches have emerged that apply decoder-only
language models directly to the seq2seq task. Despite the significant
advancements in applying language models to the seq2seq task, there is still a
lack of thorough analysis on the effectiveness of the decoder-only language
model architecture. This paper aims to address this gap by conducting a
detailed comparison between the encoder-decoder architecture and the
decoder-only language model framework through the analysis of a regularized
encoder-decoder structure. This structure is designed to replicate all
behaviors in the classical decoder-only language model but has an encoder and a
decoder making it easier to be compared with the classical encoder-decoder
structure. Based on the analysis, we unveil the attention degeneration problem
in the language model, namely, as the generation step number grows, less and
less attention is focused on the source sequence. To give a quantitative
understanding of this problem, we conduct a theoretical sensitivity analysis of
the attention output with respect to the source input. Grounded on our
analysis, we propose a novel partial attention language model to solve the
attention degeneration problem. Experimental results on machine translation,
summarization, and data-to-text generation tasks support our analysis and
demonstrate the effectiveness of our proposed model.
Related papers
- Exploring Automatic Evaluation Methods based on a Decoder-based LLM for
Text Generation [16.78350863261211]
This paper compares various methods, including tuning with encoder-based models and large language models under equal conditions.
Experimental results show that compared to the tuned encoder-based models, the tuned decoder-based models perform poorly.
It is also revealed that in-context learning of very large decoder-based models such as ChatGPT makes it difficult to identify fine-grained semantic differences.
arXiv Detail & Related papers (2023-10-17T06:53:00Z) - DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers [6.405360669408265]
We propose a simple, new method to analyze encoder-decoder Transformers: DecoderLens.
Inspired by the LogitLens (for decoder-only Transformers), this method involves allowing the decoder to cross-attend representations of intermediate encoder layers.
We report results from the DecoderLens applied to models trained on question answering, logical reasoning, speech recognition and machine translation.
arXiv Detail & Related papers (2023-10-05T17:04:59Z) - Is Encoder-Decoder Redundant for Neural Machine Translation? [44.37101354412253]
encoder-decoder architecture is still the de facto neural network architecture for state-of-the-art models.
In this work, we experiment with bilingual translation, translation with additional target monolingual data, and multilingual translation.
This alternative approach performs on par with the baseline encoder-decoder Transformer, suggesting that an encoder-decoder architecture might be redundant for neural machine translation.
arXiv Detail & Related papers (2022-10-21T08:33:55Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking
Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking.
We finetune a pretrained encoder-decoder model using in the form of document to query generation.
We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z) - Sparsity and Sentence Structure in Encoder-Decoder Attention of
Summarization Systems [38.672160430296536]
Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization.
Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder.
This work focuses on the transformer's encoder-decoder attention mechanism.
arXiv Detail & Related papers (2021-09-08T19:32:42Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z) - Rethinking and Improving Natural Language Generation with Layer-Wise
Multi-View Decoding [59.48857453699463]
In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder.
Recent work has proposed to use representations from different encoder layers for diversified levels of information.
We propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences.
arXiv Detail & Related papers (2020-05-16T20:00:39Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.