S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models
- URL: http://arxiv.org/abs/2407.01955v1
- Date: Tue, 2 Jul 2024 05:14:15 GMT
- Title: S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models
- Authors: Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh,
- Abstract summary: We introduce a novel multi-target scenario for the deployment of draft models for faster inference.
We present a novel, more efficient sorted speculative decoding mechanism that outperforms regular baselines in multi-target settings.
- Score: 32.68002253527712
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deployment of autoregressive large language models (LLMs) is costly, and as these models increase in size, the associated costs will become even more considerable. Consequently, different methods have been proposed to accelerate the token generation process and reduce costs. Speculative decoding (SD) is among the most promising approaches to speed up the LLM decoding process by verifying multiple tokens in parallel and using an auxiliary smaller draft model to generate the possible tokens. In SD, usually, one draft model is used to serve a specific target model; however, in practice, LLMs are diverse, and we might need to deal with many target models or more than one target model simultaneously. In this scenario, it is not clear which draft model should be used for which target model, and searching among different draft models or training customized draft models can further increase deployment costs. In this paper, we first introduce a novel multi-target scenario for the deployment of draft models for faster inference. Then, we present a novel, more efficient sorted speculative decoding mechanism that outperforms regular baselines in multi-target settings. We evaluated our method on Spec-Bench in different settings, including base models such as Vicuna 7B, 13B, and LLama Chat 70B. Our results suggest that our draft models perform better than baselines for multiple target models at the same time.
Related papers
- CARD: Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference [19.14564724894706]
We propose a speculative decoding framework employing a 'query-and-correct' paradigm.<n> CARD decouples drafting and verification: the draft model generates candidate tokens to populate a shared cache, while the target model concurrently rectifies the draft model's generation direction.<n>Our approach achieves up to 4.83 speedup over vanilla decoding without requiring fine-tuning of either the draft or target models.
arXiv Detail & Related papers (2025-08-06T14:02:10Z) - OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding [8.589209709453026]
We propose OmniDraft, a unified framework that enables a single draft model to operate with any target model.<n>We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models.<n>We showcase the proficiency of the framework by performing online learning on math reasoning, coding and text generation tasks.
arXiv Detail & Related papers (2025-07-03T14:20:41Z) - Mamba Drafters for Speculative Decoding [58.080550222549064]
We introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM)<n>By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods.<n>We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates.
arXiv Detail & Related papers (2025-06-01T22:52:47Z) - Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [103.98582374569789]
Model merging aims to combine multiple expert models into a single model, thereby reducing storage and serving costs.<n>Previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks.<n>We introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models.
arXiv Detail & Related papers (2025-05-26T12:23:14Z) - Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models [50.19188692497892]
Traditional alignment methods often require retraining large pretrained models.<n>We propose a novel textitResidual Alignment Model (textitRAM) that formalizes the alignment process as a type of importance sampling.<n>We develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods.
arXiv Detail & Related papers (2025-05-26T08:53:02Z) - MoD: A Distribution-Based Approach for Merging Large Language Models [0.0]
Large language models (LLMs) have enabled the development of numerous specialized, task-specific variants.
We propose the textitMixture of Distributions (MoD) framework, a novel approach for merging LLMs.
Unlike traditional weight-averaging methods, MoD effectively preserves the specialized capabilities of individual models.
arXiv Detail & Related papers (2024-11-01T07:05:29Z) - ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - Improving Multi-candidate Speculative Decoding [1.6291177798903276]
Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs)
In this work, we introduce a new version of MCSD that includes a target model multi-candidate generation.
We also evaluate the effects of using the target model multi-candidate process with different draft models on output quality.
arXiv Detail & Related papers (2024-09-16T18:20:38Z) - Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models [28.62382804829694]
Large language models (LLMs) remain prohibitive to use under resource constraints.
High latency associated with auto-regressive generation renders large LLMs dependent on advanced computing infrastructure.
assisted decoding has helped alleviate this, but remains dependent on alignment between the two models.
arXiv Detail & Related papers (2024-08-16T01:12:21Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming.
One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model.
This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification.
We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z) - DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens.
We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD.
We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z) - When Ensembling Smaller Models is More Efficient than Single Large
Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.