PaSS: Parallel Speculative Sampling
- URL: http://arxiv.org/abs/2311.13581v1
- Date: Wed, 22 Nov 2023 18:37:27 GMT
- Title: PaSS: Parallel Speculative Sampling
- Authors: Giovanni Monea, Armand Joulin, Edouard Grave
- Abstract summary: Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks.
At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and thus reading the full set of parameters from memory.
We show promising performance (up to $30%$ speed-up) while requiring only as few as $O(d_emb)$ additional parameters.
- Score: 29.23180061749074
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling the size of language models to tens of billions of parameters has led
to impressive performance on a wide range of tasks. At generation, these models
are used auto-regressively, requiring a forward pass for each generated token,
and thus reading the full set of parameters from memory. This memory access
forms the primary bottleneck for generation and it worsens as the model size
increases. Moreover, executing a forward pass for multiple tokens in parallel
often takes nearly the same time as it does for just one token. These two
observations lead to the development of speculative sampling, where a second
smaller model is used to draft a few tokens, that are then validated or
rejected using a single forward pass of the large model. Unfortunately, this
method requires two models that share the same tokenizer and thus limits its
adoption. As an alternative, we propose to use parallel decoding as a way to
draft multiple tokens from a single model with no computational cost, nor the
need for a second model. Our approach only requires an additional input token
that marks the words that will be generated simultaneously. We show promising
performance (up to $30\%$ speed-up) while requiring only as few as $O(d_{emb})$
additional parameters.
Related papers
- TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters [102.1116808722299]
We introduce TokenFormer, a scalable architecture for scaling Transformers.
By treating model parameters as tokens, we replace all the linear projections in Transformers.
Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs.
arXiv Detail & Related papers (2024-10-30T16:19:00Z) - AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration [0.3626013617212667]
We introduce AMUSD (Asynchronous Multi-device Speculative Decoding), a system that accelerates generation by decoupling the draft and verify phases.
Unlike conventional speculative decoding, where only one model (draft or verify) performs token generation at a time, AMUSD enables both models to perform predictions independently on separate devices.
We evaluate our approach over multiple datasets and show that AMUSD achieves an average 29% improvement over speculative decoding and up to 1.96$times$ speedup over conventional autoregressive decoding.
arXiv Detail & Related papers (2024-10-22T19:15:35Z) - ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference [30.722379261991563]
LazyLLM is a method that selectively computes the KV for tokens important for the next token prediction.
We show that LazyLLM accelerates the prefilling stage of the LLama 2 7B model by 2.34x while maintaining accuracy.
arXiv Detail & Related papers (2024-07-19T06:34:45Z) - S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models [32.68002253527712]
We introduce a novel multi-target scenario for the deployment of draft models for faster inference.
We present a novel, more efficient sorted speculative decoding mechanism that outperforms regular baselines in multi-target settings.
arXiv Detail & Related papers (2024-07-02T05:14:15Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Tandem Transformers for Inference Efficient LLMs [49.75726447408795]
We introduce a novel architecture, Tandem transformers, to address these issues.
This architecture uniquely combines a small autoregressive model and a large model operating in block mode.
On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy.
arXiv Detail & Related papers (2024-02-13T18:24:08Z) - Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming.
One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model.
This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification.
We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z) - Chunk-based Nearest Neighbor Machine Translation [7.747003493657217]
We introduce a textitchunk-based $k$NN-MT model which retrieves chunks of tokens from the datastore, instead of a single token.
Experiments on machine translation in two settings, static domain adaptation and on-the-fly'' adaptation, show that the chunk-based model leads to a significant speed-up (up to 4 times) with only a small drop in translation quality.
arXiv Detail & Related papers (2022-05-24T17:39:25Z) - LAVA NAT: A Non-Autoregressive Translation Model with Look-Around
Decoding and Vocabulary Attention [54.18121922040521]
Non-autoregressive translation (NAT) models generate multiple tokens in one forward pass.
These NAT models often suffer from the multimodality problem, generating duplicated tokens or missing tokens.
We propose two novel methods to address this issue, the Look-Around (LA) strategy and the Vocabulary Attention (VA) mechanism.
arXiv Detail & Related papers (2020-02-08T04:11:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.