SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
- URL: http://arxiv.org/abs/2406.02532v2
- Date: Tue, 25 Jun 2024 19:35:06 GMT
- Title: SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
- Authors: Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin,
- Abstract summary: SpecExec is a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families.
We demonstrate inference of 50B+ parameter LLMs on consumer GPUs with RAM offloading at 4-6 tokens per second with 4-bit quantization or 2-3 tokens per second with 16-bit weights.
- Score: 18.81344021150902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit the largest available models (50B+ parameters) and must offload them to RAM or SSD. When running with offloaded parameters, the inference engine can process batches of hundreds or thousands of tokens at the same time as just one token, making it a natural fit for speculative decoding. We propose SpecExec (Speculative Execution), a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families. It utilizes the high spikiness of the token probabilities distribution in modern LLMs and a high degree of alignment between model output probabilities. SpecExec takes the most probable tokens continuation from the draft model to build a "cache" tree for the target model, which then gets validated in a single pass. Using SpecExec, we demonstrate inference of 50B+ parameter LLMs on consumer GPUs with RAM offloading at 4-6 tokens per second with 4-bit quantization or 2-3 tokens per second with 16-bit weights.
Related papers
- Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.
We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.
Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Tandem Transformers for Inference Efficient LLMs [49.75726447408795]
We introduce a novel architecture, Tandem transformers, to address these issues.
This architecture uniquely combines a small autoregressive model and a large model operating in block mode.
On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy.
arXiv Detail & Related papers (2024-02-13T18:24:08Z) - Break the Sequential Dependency of LLM Inference Using Lookahead
Decoding [27.87483106859749]
Lookahead decoding is an exact, parallel decoding algorithm for large language models (LLMs)
Our implementation can speed up autoregressive decoding by up to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code completion tasks.
arXiv Detail & Related papers (2024-02-03T06:37:50Z) - Distributed Inference and Fine-tuning of Large Language Models Over The
Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z) - PaSS: Parallel Speculative Sampling [29.23180061749074]
Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks.
At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and thus reading the full set of parameters from memory.
We show promising performance (up to $30%$ speed-up) while requiring only as few as $O(d_emb)$ additional parameters.
arXiv Detail & Related papers (2023-11-22T18:37:27Z) - Efficient LLM Inference on CPUs [8.802223672775844]
Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks.
deploying these models has been challenging due to the astronomical amount of model parameters.
We propose an effective approach that can make the deployment of LLMs more efficiently.
arXiv Detail & Related papers (2023-11-01T13:08:50Z) - LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models [83.98062659664785]
Large language models (LLMs) typically train on short text segments (e.g., 4K tokens) due to the quadratic complexity of their Transformer architectures.
This work identifies three major factors contributing to this length generalization failure.
We propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts.
arXiv Detail & Related papers (2023-08-30T16:47:51Z) - SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification [13.174386920965107]
SpecInfer is a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification.
The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism.
arXiv Detail & Related papers (2023-05-16T20:12:59Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.