Related papers: Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks

Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks

URL: http://arxiv.org/abs/2403.13112v2
Date: Thu, 23 May 2024 17:17:26 GMT
Title: Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks
Authors: Bo-Ru Lu, Nikita Haduong, Chien-Yu Lin, Hao Cheng, Noah A. Smith, Mari Ostendorf,
Abstract summary: We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks. Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency.
Score: 53.550782959908524
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Transformer-based NLP models are powerful but have high computational costs that limit deployment. Finetuned encoder-decoder models are popular in specialized domains and can outperform larger more generalized decoder-only models, such as GPT-4. We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks where multiple outputs are required for a single shared input. Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency by avoiding duplicate input encoding and increasing the operational intensity (ratio of numbers of arithmetic operation to memory access) of decoding process by sharing the input key-value cache. We achieve computation reduction that roughly scales with the number of subtasks, gaining up to 4.6x speed-up over state-of-the-art models for dialogue state tracking, summarization, and question-answering tasks, with comparable or better performance.

Related papers

Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders. We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z)
Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image. The proposed approach reduces expected encoder computational cost while maintaining performance. It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z)
Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames. We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z)
DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models [22.276574156358084]
We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions. We show our approach can reduce overall inference latency by 30%-60% with comparable or even higher accuracy compared to baselines.
arXiv Detail & Related papers (2023-11-15T01:01:02Z)
NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models [29.468888611690346]
We propose a simple and effective framework, NASH, that narrows the encoder and shortens the decoder networks of encoder-decoder models. Our findings highlight two insights: (1) the number of decoder layers is the dominant factor of inference speed, and (2) low sparsity in the pruned encoder network enhances generation quality.
arXiv Detail & Related papers (2023-10-16T04:27:36Z)
Efficient Controllable Multi-Task Architectures [85.76598445904374]
We propose a multi-task model consisting of a shared encoder and task-specific decoders where both encoder and decoder channel widths are slimmable. Our key idea is to control the task importance by varying the capacities of task-specific decoders, while controlling the total computational cost. This improves overall accuracy by allowing a stronger encoder for a given budget, increases control over computational cost, and delivers high-quality slimmed sub-architectures.
arXiv Detail & Related papers (2023-08-22T19:09:56Z)
Hyperdecoders: Instance-specific decoders for multi-task NLP [9.244884318445413]
We investigate input-conditioned hypernetworks for multi-tasking in NLP. We generate parameter-efficient adaptations for a decoder using a hypernetwork conditioned on the output of an encoder.
arXiv Detail & Related papers (2022-03-15T22:39:53Z)
On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [90.58793284654692]
We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder. The gates are regularized using the expected value of the sparsity-inducing L0penalty. We investigate the effects of this sparsification on two machine translation and two summarization tasks.
arXiv Detail & Related papers (2020-04-24T16:57:52Z)
Balancing Cost and Benefit with Tied-Multi Transformers [24.70761584719857]
In sequence-to-sequence modeling, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is used to compute loss. Our method computes a single loss consisting of NxM losses, where each loss is computed from the output of one of the M decoder layers connected to one of the N encoder layers. Such a model subsumes NxM models with different number of encoder and decoder layers, and can be used for decoding with fewer than the maximum number of encoder and decoder layers.
arXiv Detail & Related papers (2020-02-20T08:20:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.