SSSD: Simply-Scalable Speculative Decoding
- URL: http://arxiv.org/abs/2411.05894v1
- Date: Fri, 08 Nov 2024 14:23:02 GMT
- Title: SSSD: Simply-Scalable Speculative Decoding
- Authors: Michele Marzollo, Jiawei Zhuang, Niklas Roemer, Lorenz K. Müller, Lukas Cavigelli,
- Abstract summary: Speculative Decoding has gained popularity as a technique for accelerating Large Language Model inference.
We offer a theoretical explanation of how Speculative Decoding can be effectively utilized with larger batch sizes.
- Score: 4.613725465729454
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Over the past year, Speculative Decoding has gained popularity as a technique for accelerating Large Language Model inference. While several methods have been introduced, most struggle to deliver satisfactory performance at batch sizes typical for data centers ($\geq 8$) and often involve significant deployment complexities. In this work, we offer a theoretical explanation of how Speculative Decoding can be effectively utilized with larger batch sizes. We also introduce a method that integrates seamlessly into existing systems without additional training or the complexity of deploying a small LLM. In a continuous batching setting, we achieve a 4x increase in throughput without any latency impact for short context generation, and a 1.7-2x improvement in both latency and throughput for longer contexts.
Related papers
- Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning [76.32953653161417]
Class-incremental learning enables models to learn new classes progressively while preserving knowledge of previously learned ones.
Recent advances in this field have shifted towards parameter-efficient fine-tuning techniques.
We present a novel prompt-based approach that addresses the limitation of current approaches.
arXiv Detail & Related papers (2025-03-11T02:27:37Z) - LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification [42.54363549922909]
Speculative decoding has become a promising technique to mitigate the high inference latency of autoregressive decoding in Large Language Models.
Despite its promise, the effective application of speculative decoding in LLMs still confronts three key challenges.
We enhance the performance of speculative decoding in long-context settings by addressing these challenges.
arXiv Detail & Related papers (2025-02-24T18:53:31Z) - InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU [48.105361428245736]
We introduce InfiniteHiP, an inference framework for large language models (LLMs)
We dynamically eliminate irrelevant context tokens through a modular hierarchical token pruning algorithm.
Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training.
arXiv Detail & Related papers (2025-02-13T02:52:01Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding [12.74265334789358]
We show that speculative decoding can achieve speedup even for a high throughput inference regime for moderate to long sequences.
We propose a theoretical model to select the optimal drafting strategy for maximum speedup.
For moderate to long sequences, we demonstrate up to 2.51x speedup for Llama3.1-8B when serving batch sizes ranging from 32 to 256.
arXiv Detail & Related papers (2024-08-20T17:57:31Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model.
In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z) - KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs)
This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z) - One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments [43.107261545706415]
Large Language Models (LLMs) have advanced rapidly but face significant memory demands.
Current methods typically require lengthy training to alleviate the performance degradation from quantization loss.
We make an initial attempt to extend the once-for-all framework to large language models.
arXiv Detail & Related papers (2024-05-30T16:05:15Z) - RL-GPT: Integrating Reinforcement Learning and Code-as-policy [82.1804241891039]
We introduce a two-level hierarchical framework, RL-GPT, comprising a slow agent and a fast agent.
The slow agent analyzes actions suitable for coding, while the fast agent executes coding tasks.
This decomposition effectively focuses each agent on specific tasks, proving highly efficient within our pipeline.
arXiv Detail & Related papers (2024-02-29T16:07:22Z) - Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes.
This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z) - DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and
DeepSpeed-Inference [23.49242865222089]
This paper introduces DeepSpeed-FastGen, a system that delivers up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency.
We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for large language models.
arXiv Detail & Related papers (2024-01-09T06:49:40Z) - Extending Context Window of Large Language Models via Semantic
Compression [21.35020344956721]
Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses.
We propose a novel semantic compression method that enables generalization to texts 6-8 times longer, without incurring significant computational costs or requiring fine-tuning.
arXiv Detail & Related papers (2023-12-15T07:04:33Z) - One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models [42.95555008229016]
We propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining.
The advantages of the proposed method exhibit even more when the sparsity is extremely high.
arXiv Detail & Related papers (2023-10-14T05:43:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.