Balancing Cost and Benefit with Tied-Multi Transformers
- URL: http://arxiv.org/abs/2002.08614v1
- Date: Thu, 20 Feb 2020 08:20:52 GMT
- Title: Balancing Cost and Benefit with Tied-Multi Transformers
- Authors: Raj Dabre, Raphael Rubino, Atsushi Fujita
- Abstract summary: In sequence-to-sequence modeling, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is used to compute loss.
Our method computes a single loss consisting of NxM losses, where each loss is computed from the output of one of the M decoder layers connected to one of the N encoder layers.
Such a model subsumes NxM models with different number of encoder and decoder layers, and can be used for decoding with fewer than the maximum number of encoder and decoder layers.
- Score: 24.70761584719857
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We propose and evaluate a novel procedure for training multiple Transformers
with tied parameters which compresses multiple models into one enabling the
dynamic choice of the number of encoder and decoder layers during decoding. In
sequence-to-sequence modeling, typically, the output of the last layer of the
N-layer encoder is fed to the M-layer decoder, and the output of the last
decoder layer is used to compute loss. Instead, our method computes a single
loss consisting of NxM losses, where each loss is computed from the output of
one of the M decoder layers connected to one of the N encoder layers. Such a
model subsumes NxM models with different number of encoder and decoder layers,
and can be used for decoding with fewer than the maximum number of encoder and
decoder layers. We then propose a mechanism to choose a priori the number of
encoder and decoder layers for faster decoding, and also explore recurrent
stacking of layers and knowledge distillation for model compression. We present
a cost-benefit analysis of applying the proposed approaches for neural machine
translation and show that they reduce decoding costs while preserving
translation quality.
Related papers
- Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks [53.550782959908524]
We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks.
Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency.
arXiv Detail & Related papers (2024-03-19T19:27:23Z) - DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder
Transformer Models [22.276574156358084]
We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions.
We show our approach can reduce overall inference latency by 30%-60% with comparable or even higher accuracy compared to baselines.
arXiv Detail & Related papers (2023-11-15T01:01:02Z) - Reducing Redundancy in the Bottleneck Representation of the Autoencoders [98.78384185493624]
Autoencoders are a type of unsupervised neural networks, which can be used to solve various tasks.
We propose a scheme to explicitly penalize feature redundancies in the bottleneck representation.
We tested our approach across different tasks: dimensionality reduction using three different dataset, image compression using the MNIST dataset, and image denoising using fashion MNIST.
arXiv Detail & Related papers (2022-02-09T18:48:02Z) - Dynamic Neural Representational Decoders for High-Resolution Semantic
Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD)
As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks.
This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z) - HARP-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable
Neural Audio Coding [25.51661602383911]
An autoencoder-based decoder employs quantization to turn its bottleneck layer activation into bitstrings.
To circumvent this issue, we employ additional skip connections between the corresponding pair of encoder-decoder layers.
We empirically verify that the proposed hyper-autoencoded architecture improves audio quality compared to an ordinary autoencoder baseline.
arXiv Detail & Related papers (2021-07-22T17:57:53Z) - On the Sub-Layer Functionalities of Transformer Decoder [74.83087937309266]
We study how Transformer-based decoders leverage information from the source and target languages.
Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance.
arXiv Detail & Related papers (2020-10-06T11:50:54Z) - Rethinking and Improving Natural Language Generation with Layer-Wise
Multi-View Decoding [59.48857453699463]
In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder.
Recent work has proposed to use representations from different encoder layers for diversified levels of information.
We propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences.
arXiv Detail & Related papers (2020-05-16T20:00:39Z) - On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [90.58793284654692]
We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder.
The gates are regularized using the expected value of the sparsity-inducing L0penalty.
We investigate the effects of this sparsification on two machine translation and two summarization tasks.
arXiv Detail & Related papers (2020-04-24T16:57:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.