SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
- URL: http://arxiv.org/abs/2406.02532v2
- Date: Tue, 25 Jun 2024 19:35:06 GMT
- Title: SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
- Authors: Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin,
- Abstract summary: SpecExec is a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families.
We demonstrate inference of 50B+ parameter LLMs on consumer GPUs with RAM offloading at 4-6 tokens per second with 4-bit quantization or 2-3 tokens per second with 16-bit weights.
- Score: 18.81344021150902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit the largest available models (50B+ parameters) and must offload them to RAM or SSD. When running with offloaded parameters, the inference engine can process batches of hundreds or thousands of tokens at the same time as just one token, making it a natural fit for speculative decoding. We propose SpecExec (Speculative Execution), a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families. It utilizes the high spikiness of the token probabilities distribution in modern LLMs and a high degree of alignment between model output probabilities. SpecExec takes the most probable tokens continuation from the draft model to build a "cache" tree for the target model, which then gets validated in a single pass. Using SpecExec, we demonstrate inference of 50B+ parameter LLMs on consumer GPUs with RAM offloading at 4-6 tokens per second with 4-bit quantization or 2-3 tokens per second with 16-bit weights.
Related papers
- ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling [53.58854856174773]
Speculative decoding is an approach to accelerate inference through a guess-and-verify paradigm.
Token Recycling stores candidate tokens in an adjacency matrix and employs a breadth-first search algorithm.
It significantly outperforms existing train-free methods by 30% and even a training method by 25%.
arXiv Detail & Related papers (2024-08-16T12:20:56Z) - Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.
We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.
Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - Tandem Transformers for Inference Efficient LLMs [49.75726447408795]
We introduce a novel architecture, Tandem transformers, to address these issues.
This architecture uniquely combines a small autoregressive model and a large model operating in block mode.
On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy.
arXiv Detail & Related papers (2024-02-13T18:24:08Z) - Break the Sequential Dependency of LLM Inference Using Lookahead
Decoding [27.87483106859749]
Lookahead decoding is an exact, parallel decoding algorithm for large language models (LLMs)
Our implementation can speed up autoregressive decoding by up to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code completion tasks.
arXiv Detail & Related papers (2024-02-03T06:37:50Z) - Distributed Inference and Fine-tuning of Large Language Models Over The
Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z) - Efficient LLM Inference on CPUs [8.802223672775844]
Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks.
deploying these models has been challenging due to the astronomical amount of model parameters.
We propose an effective approach that can make the deployment of LLMs more efficiently.
arXiv Detail & Related papers (2023-11-01T13:08:50Z) - LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models [83.98062659664785]
Large language models (LLMs) typically train on short text segments (e.g., 4K tokens) due to the quadratic complexity of their Transformer architectures.
This work identifies three major factors contributing to this length generalization failure.
We propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts.
arXiv Detail & Related papers (2023-08-30T16:47:51Z) - SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification [13.174386920965107]
SpecInfer is a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification.
The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism.
arXiv Detail & Related papers (2023-05-16T20:12:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.