Accelerating Transformer Inference for Translation via Parallel Decoding
- URL: http://arxiv.org/abs/2305.10427v1
- Date: Wed, 17 May 2023 17:57:34 GMT
- Title: Accelerating Transformer Inference for Translation via Parallel Decoding
- Authors: Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino
Maiorca, Michele Mancusi, Riccardo Marin, Emanuele Rodol\`a
- Abstract summary: Autoregressive decoding limits the efficiency of transformers for Machine Translation (MT)
We present three parallel decoding algorithms and test them on different languages and models.
- Score: 2.89306442817912
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Autoregressive decoding limits the efficiency of transformers for Machine
Translation (MT). The community proposed specific network architectures and
learning-based methods to solve this issue, which are expensive and require
changes to the MT model, trading inference speed at the cost of the translation
quality. In this paper, we propose to address the problem from the point of
view of decoding algorithms, as a less explored but rather compelling
direction. We propose to reframe the standard greedy autoregressive decoding of
MT with a parallel formulation leveraging Jacobi and Gauss-Seidel fixed-point
iteration methods for fast inference. This formulation allows to speed up
existing models without training or modifications while retaining translation
quality. We present three parallel decoding algorithms and test them on
different languages and models showing how the parallelization introduces a
speedup up to 38% w.r.t. the standard autoregressive decoding and nearly 2x
when scaling the method on parallel resources. Finally, we introduce a decoding
dependency graph visualizer (DDGviz) that let us see how the model has learned
the conditional dependence between tokens and inspect the decoding procedure.
Related papers
- Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Let the Code LLM Edit Itself When You Edit the Code [50.46536185784169]
underlinetextbfPositional textbfIntegrity textbfEncoding (PIE)
PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
arXiv Detail & Related papers (2024-07-03T14:34:03Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Break the Sequential Dependency of LLM Inference Using Lookahead
Decoding [27.87483106859749]
Lookahead decoding is an exact, parallel decoding algorithm for large language models (LLMs)
Our implementation can speed up autoregressive decoding by up to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code completion tasks.
arXiv Detail & Related papers (2024-02-03T06:37:50Z) - Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding.
We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z) - Fast Inference from Transformers via Speculative Decoding [3.950600027250452]
Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model.
In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel.
arXiv Detail & Related papers (2022-11-30T17:33:28Z) - Glancing Transformer for Non-Autoregressive Neural Machine Translation [58.87258329683682]
We propose a method to learn word interdependency for single-pass parallel generation models.
With only single-pass parallel decoding, GLAT is able to generate high-quality translation with 8-15 times speedup.
arXiv Detail & Related papers (2020-08-18T13:04:03Z) - Cascaded Text Generation with Markov Transformers [122.76100449018061]
Two dominant approaches to neural text generation are fully autoregressive models, using serial beam search decoding, and non-autoregressive models, using parallel decoding with no output dependencies.
This work proposes an autoregressive model with sub-linear parallel time generation. Noting that conditional random fields with bounded context can be decoded in parallel, we propose an efficient cascaded decoding approach for generating high-quality output.
This approach requires only a small modification from standard autoregressive training, while showing competitive accuracy/speed tradeoff compared to existing methods on five machine translation datasets.
arXiv Detail & Related papers (2020-06-01T17:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.