Related papers: Speculative Streaming: Fast LLM Inference without Auxiliary Models

Speculative Streaming: Fast LLM Inference without Auxiliary Models

URL: http://arxiv.org/abs/2402.11131v1
Date: Fri, 16 Feb 2024 23:36:43 GMT
Title: Speculative Streaming: Fast LLM Inference without Auxiliary Models
Authors: Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi
Abstract summary: Speculative Streaming is a single-model speculative decoding method. It fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. It speeds up decoding by 1.8 - 3.1X in a diverse set of tasks.
Score: 21.454206732725563
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter-efficient. It achieves on-par/higher speed-ups than Medusa-style architectures while using ~10000X fewer extra parameters, making it well-suited for resource-constrained devices.

Related papers

SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences [4.268504966623081]
We introduce SpecExtend, a drop-in enhancement that improves the performance of speculative decoding on long sequences.<n>SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention into both the draft and target models.<n>To improve draft accuracy and speed, we propose Cross-model Retrieval, a novel KV cache update strategy.
arXiv Detail & Related papers (2025-05-27T06:30:00Z)
Automatic Task Detection and Heterogeneous LLM Speculative Decoding [1.0485739694839669]
We propose a speculative decoding algorithm tailored for downstream task optimization.<n>It includes an automatic task partitioning and assigning method, which automatically categorizes downstream tasks into different sub-tasks.<n> Experimental results demonstrate that the proposed method improves draft accuracy by 6% to 50% over vanilla speculative decoding.
arXiv Detail & Related papers (2025-05-13T14:16:12Z)
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models [21.933379266533098]
Large Language Models (LLMs) present a critical trade-off between inference quality and computational cost.<n>Existing serving strategies often employ fixed model scales or static two-stage speculative decoding.<n>This paper introduces systemname, a novel framework that reimagines LLM inference as an adaptive routing problem.
arXiv Detail & Related papers (2025-05-12T15:46:28Z)
DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting [59.57151419673759]
Speculative decoding presents a draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity. We propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively. Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality.
arXiv Detail & Related papers (2025-03-02T08:27:48Z)
Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE [15.003006630308517]
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens. We propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions. Our method significantly boosts prediction accuracy and achieves higher inference speedups.
arXiv Detail & Related papers (2025-02-10T09:24:06Z)
Efficient and Versatile Robust Fine-Tuning of Zero-shot Models [34.27380518351181]
We introduce Robust Adapter (R-Adapter), a novel method for fine-tuning zero-shot models to downstream tasks. Our method integrates lightweight modules into the pre-trained model and employs novel self-ensemble techniques to boost OOD robustness and reduce storage expenses substantially. Our experiments demonstrate that R-Adapter achieves state-of-the-art performance across a diverse set of tasks, tuning only 13% of the parameters of the CLIP encoders.
arXiv Detail & Related papers (2024-08-11T11:37:43Z)
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [59.17158389902231]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference. This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z)
Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models. We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z)
Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context. The typical autoregressive decoding method requires a separate forward pass through the model for each token generated. We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z)
Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling. Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words. We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z)
GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding [81.01996600734616]
We introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding. GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM. We will release our code, data, and the trained draft models.
arXiv Detail & Related papers (2024-02-03T08:44:11Z)
Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model. This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification. We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z)
Cascade Speculative Drafting for Even Faster LLM Inference [25.642604897018852]
Speculative decoding improves the efficiency of large language model (LLM) inference. We introduce Cascade Speculative Drafting (CS Drafting), a speculative execution algorithm that incorporates two types of cascades. CS Drafting achieves up to an 81 percent additional speedup over speculative decoding in our experiments.
arXiv Detail & Related papers (2023-12-18T18:59:46Z)
DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens. We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD. We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z)
Online Speculative Decoding [34.987825705622555]
We introduce online speculative decoding to accelerate the inference of large language models. The main idea is to continuously update the (multiple) draft model(s) on observed user query data. We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data.
arXiv Detail & Related papers (2023-10-11T04:03:42Z)
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation [80.2267931231335]
We propose Speculative Decoding (SpecDec) to study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. SpecDec has two innovations: Spec-Drafter -- an independent model specially optimized for efficient drafting, and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently.
arXiv Detail & Related papers (2022-03-30T17:27:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.