Probing Word Translations in the Transformer and Trading Decoder for
Encoder Layers
- URL: http://arxiv.org/abs/2003.09586v2
- Date: Tue, 20 Apr 2021 00:31:13 GMT
- Title: Probing Word Translations in the Transformer and Trading Decoder for
Encoder Layers
- Authors: Hongfei Xu and Josef van Genabith and Qiuhui Liu and Deyi Xiong
- Abstract summary: The way word translation evolves in Transformer layers has not yet been investigated.
We show that translation already happens progressively in encoder layers and even in the input embeddings.
Our experiments show that we can increase speed by up to a factor 2.3 with small gains in translation quality, while an 18-4 deep encoder configuration boosts translation quality by +1.42 BLEU (En-De) at a speed-up of 1.4.
- Score: 69.40942736249397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to its effectiveness and performance, the Transformer translation model
has attracted wide attention, most recently in terms of probing-based
approaches. Previous work focuses on using or probing source linguistic
features in the encoder. To date, the way word translation evolves in
Transformer layers has not yet been investigated. Naively, one might assume
that encoder layers capture source information while decoder layers translate.
In this work, we show that this is not quite the case: translation already
happens progressively in encoder layers and even in the input embeddings. More
surprisingly, we find that some of the lower decoder layers do not actually do
that much decoding. We show all of this in terms of a probing approach where we
project representations of the layer analyzed to the final trained and frozen
classifier level of the Transformer decoder to measure word translation
accuracy. Our findings motivate and explain a Transformer configuration change:
if translation already happens in the encoder layers, perhaps we can increase
the number of encoder layers, while decreasing the number of decoder layers,
boosting decoding speed, without loss in translation quality? Our experiments
show that this is indeed the case: we can increase speed by up to a factor 2.3
with small gains in translation quality, while an 18-4 deep encoder
configuration boosts translation quality by +1.42 BLEU (En-De) at a speed-up of
1.4.
Related papers
- DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers [6.405360669408265]
We propose a simple, new method to analyze encoder-decoder Transformers: DecoderLens.
Inspired by the LogitLens (for decoder-only Transformers), this method involves allowing the decoder to cross-attend representations of intermediate encoder layers.
We report results from the DecoderLens applied to models trained on question answering, logical reasoning, speech recognition and machine translation.
arXiv Detail & Related papers (2023-10-05T17:04:59Z) - GTrans: Grouping and Fusing Transformer Layers for Neural Machine
Translation [107.2752114891855]
Transformer structure, stacked by a sequence of encoder and decoder network layers, achieves significant development in neural machine translation.
We propose the Group-Transformer model (GTrans) that flexibly divides multi-layer representations of both encoder and decoder into different groups and then fuses these group features to generate target words.
arXiv Detail & Related papers (2022-07-29T04:10:36Z) - Multilingual Neural Machine Translation with Deep Encoder and Multiple
Shallow Decoders [77.2101943305862]
We propose a deep encoder with multiple shallow decoders (DEMSD) where each shallow decoder is responsible for a disjoint subset of target languages.
DEMSD model with 2-layer decoders is able to obtain a 1.8x speedup on average compared to a standard transformer model with no drop in translation quality.
arXiv Detail & Related papers (2022-06-05T01:15:04Z) - DeltaLM: Encoder-Decoder Pre-training for Language Generation and
Translation by Augmenting Pretrained Multilingual Encoders [92.90543340071007]
We introduce DeltaLM, a pretrained multilingual encoder-decoder model.
Specifically, we augment the pretrained multilingual encoder with a decoder and pre-train it in a self-supervised way.
Experiments show that DeltaLM outperforms various strong baselines on both natural language generation and translation tasks.
arXiv Detail & Related papers (2021-06-25T16:12:10Z) - On the Sub-Layer Functionalities of Transformer Decoder [74.83087937309266]
We study how Transformer-based decoders leverage information from the source and target languages.
Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance.
arXiv Detail & Related papers (2020-10-06T11:50:54Z) - On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [90.58793284654692]
We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder.
The gates are regularized using the expected value of the sparsity-inducing L0penalty.
We investigate the effects of this sparsification on two machine translation and two summarization tasks.
arXiv Detail & Related papers (2020-04-24T16:57:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.