DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder
Transformer Models
- URL: http://arxiv.org/abs/2311.08623v1
- Date: Wed, 15 Nov 2023 01:01:02 GMT
- Title: DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder
Transformer Models
- Authors: Peng Tang, Pengkai Zhu, Tian Li, Srikar Appalaraju, Vijay Mahadevan,
R. Manmatha
- Abstract summary: We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions.
We show our approach can reduce overall inference latency by 30%-60% with comparable or even higher accuracy compared to baselines.
- Score: 22.276574156358084
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Encoder-decoder transformer models have achieved great success on various
vision-language (VL) tasks, but they suffer from high inference latency.
Typically, the decoder takes up most of the latency because of the
auto-regressive decoding. To accelerate the inference, we propose an approach
of performing Dynamic Early Exit on Decoder (DEED). We build a multi-exit
encoder-decoder transformer model which is trained with deep supervision so
that each of its decoder layers is capable of generating plausible predictions.
In addition, we leverage simple yet practical techniques, including shared
generation head and adaptation modules, to keep accuracy when exiting at
shallow decoder layers. Based on the multi-exit model, we perform step-level
dynamic early exit during inference, where the model may decide to use fewer
decoder layers based on its confidence of the current layer at each individual
decoding step. Considering different number of decoder layers may be used at
different decoding steps, we compute deeper-layer decoder features of previous
decoding steps just-in-time, which ensures the features from different decoding
steps are semantically aligned. We evaluate our approach with two
state-of-the-art encoder-decoder transformer models on various VL tasks. We
show our approach can reduce overall inference latency by 30%-60% with
comparable or even higher accuracy compared to baselines.
Related papers
- Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks [53.550782959908524]
We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks.
Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency.
arXiv Detail & Related papers (2024-03-19T19:27:23Z) - Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition [20.052245837954175]
We propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture.
We introduce an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference.
A hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation.
arXiv Detail & Related papers (2023-12-27T21:04:26Z) - Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features.
We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps.
We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z) - NASH: A Simple Unified Framework of Structured Pruning for Accelerating
Encoder-Decoder Language Models [29.468888611690346]
We propose a simple and effective framework, NASH, that narrows the encoder and shortens the decoder networks of encoder-decoder models.
Our findings highlight two insights: (1) the number of decoder layers is the dominant factor of inference speed, and (2) low sparsity in the pruned encoder network enhances generation quality.
arXiv Detail & Related papers (2023-10-16T04:27:36Z) - Think Twice before Driving: Towards Scalable Decoders for End-to-End
Autonomous Driving [74.28510044056706]
Existing methods usually adopt the decoupled encoder-decoder paradigm.
In this work, we aim to alleviate the problem by two principles.
We first predict a coarse-grained future position and action based on the encoder features.
Then, conditioned on the position and action, the future scene is imagined to check the ramification if we drive accordingly.
arXiv Detail & Related papers (2023-05-10T15:22:02Z) - String-based Molecule Generation via Multi-decoder VAE [56.465033997245776]
We investigate the problem of string-based molecular generation via variational autoencoders (VAEs)
We propose a simple, yet effective idea to improve the performance of VAE for the task.
In our experiments, the proposed VAE model particularly performs well for generating a sample from out-of-domain distribution.
arXiv Detail & Related papers (2022-08-23T03:56:30Z) - Rethinking and Improving Natural Language Generation with Layer-Wise
Multi-View Decoding [59.48857453699463]
In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder.
Recent work has proposed to use representations from different encoder layers for diversified levels of information.
We propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences.
arXiv Detail & Related papers (2020-05-16T20:00:39Z) - On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [90.58793284654692]
We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder.
The gates are regularized using the expected value of the sparsity-inducing L0penalty.
We investigate the effects of this sparsification on two machine translation and two summarization tasks.
arXiv Detail & Related papers (2020-04-24T16:57:52Z) - Balancing Cost and Benefit with Tied-Multi Transformers [24.70761584719857]
In sequence-to-sequence modeling, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is used to compute loss.
Our method computes a single loss consisting of NxM losses, where each loss is computed from the output of one of the M decoder layers connected to one of the N encoder layers.
Such a model subsumes NxM models with different number of encoder and decoder layers, and can be used for decoding with fewer than the maximum number of encoder and decoder layers.
arXiv Detail & Related papers (2020-02-20T08:20:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.