On the Sub-Layer Functionalities of Transformer Decoder
- URL: http://arxiv.org/abs/2010.02648v1
- Date: Tue, 6 Oct 2020 11:50:54 GMT
- Title: On the Sub-Layer Functionalities of Transformer Decoder
- Authors: Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tadepalli, Stefan Lee
and Zhaopeng Tu
- Abstract summary: We study how Transformer-based decoders leverage information from the source and target languages.
Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance.
- Score: 74.83087937309266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There have been significant efforts to interpret the encoder of
Transformer-based encoder-decoder architectures for neural machine translation
(NMT); meanwhile, the decoder remains largely unexamined despite its critical
role. During translation, the decoder must predict output tokens by considering
both the source-language text from the encoder and the target-language prefix
produced in previous steps. In this work, we study how Transformer-based
decoders leverage information from the source and target languages --
developing a universal probe task to assess how information is propagated
through each module of each decoder layer. We perform extensive experiments on
three major translation datasets (WMT En-De, En-Fr, and En-Zh). Our analysis
provides insight on when and where decoders leverage different sources. Based
on these insights, we demonstrate that the residual feed-forward module in each
Transformer decoder layer can be dropped with minimal loss of performance -- a
significant reduction in computation and number of parameters, and consequently
a significant boost to both training and inference speed.
Related papers
- DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers [6.405360669408265]
We propose a simple, new method to analyze encoder-decoder Transformers: DecoderLens.
Inspired by the LogitLens (for decoder-only Transformers), this method involves allowing the decoder to cross-attend representations of intermediate encoder layers.
We report results from the DecoderLens applied to models trained on question answering, logical reasoning, speech recognition and machine translation.
arXiv Detail & Related papers (2023-10-05T17:04:59Z) - Investigating Pre-trained Audio Encoders in the Low-Resource Condition [66.92823764664206]
We conduct a comprehensive set of experiments using a representative set of 3 state-of-the-art encoders (Wav2vec2, WavLM, Whisper) in the low-resource setting.
We provide various quantitative and qualitative analyses on task performance, convergence speed, and representational properties of the encoders.
arXiv Detail & Related papers (2023-05-28T14:15:19Z) - Decoder-Only or Encoder-Decoder? Interpreting Language Model as a
Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence.
Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text.
Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z) - Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for
Fine-Resolution Remote Sensing Images [6.171417925832851]
We introduce the Swin Transformer as the backbone to fully extract the context information.
We also design a novel decoder named densely connected feature aggregation module (DCFAM) to restore the resolution and generate the segmentation map.
arXiv Detail & Related papers (2021-04-25T11:34:22Z) - Rethinking and Improving Natural Language Generation with Layer-Wise
Multi-View Decoding [59.48857453699463]
In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder.
Recent work has proposed to use representations from different encoder layers for diversified levels of information.
We propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences.
arXiv Detail & Related papers (2020-05-16T20:00:39Z) - On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [90.58793284654692]
We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder.
The gates are regularized using the expected value of the sparsity-inducing L0penalty.
We investigate the effects of this sparsification on two machine translation and two summarization tasks.
arXiv Detail & Related papers (2020-04-24T16:57:52Z) - Balancing Cost and Benefit with Tied-Multi Transformers [24.70761584719857]
In sequence-to-sequence modeling, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is used to compute loss.
Our method computes a single loss consisting of NxM losses, where each loss is computed from the output of one of the M decoder layers connected to one of the N encoder layers.
Such a model subsumes NxM models with different number of encoder and decoder layers, and can be used for decoding with fewer than the maximum number of encoder and decoder layers.
arXiv Detail & Related papers (2020-02-20T08:20:52Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.