POSS: Position Specialist Generates Better Draft for Speculative Decoding
- URL: http://arxiv.org/abs/2506.03566v1
- Date: Wed, 04 Jun 2025 04:30:30 GMT
- Title: POSS: Position Specialist Generates Better Draft for Speculative Decoding
- Authors: Langlin Huang, Chengsong Huang, Jixuan Leng, Di Huang, Jiaxin Huang,
- Abstract summary: Speculative decoding accelerates Large Language Model (LLM) inference.<n>Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy.<n>We propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s)
- Score: 29.847485296491204
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six datasets demonstrate that PosS effectively improves over baselines on average acceptance length and speed-up ratio. Our codebase is available at https://github.com/shrango/PosS.
Related papers
- GRIFFIN: Effective Token Alignment for Faster Speculative Decoding [52.905060461479856]
GRIFFIN is a novel framework that incorporates a token-alignable training strategy and a token-alignable draft model.<n>Experiments on LLaMA, Vicuna, Qwen and Mixtral models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 8% and a speedup ratio exceeding 7%.
arXiv Detail & Related papers (2025-02-16T07:06:00Z) - Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE [15.003006630308517]
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens.<n>We propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions.<n>Our method significantly boosts prediction accuracy and achieves higher inference speedups.
arXiv Detail & Related papers (2025-02-10T09:24:06Z) - AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability [5.421949344085942]
We show that AdaEDL consistently outperforms static draft-length speculative decoding by 10%-57%.
We also show that AdaEDL is more robust than these techniques and preserves performance in high-temperature scenarios.
arXiv Detail & Related papers (2024-10-24T01:13:43Z) - ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling.
Our research explores task-specific model pruning to inform decisions about designing SMoE architectures.
We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z) - PEARL: Parallel Speculative Decoding with Adaptive Draft Length [12.166703341906242]
We propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL)<n>PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase.<n> Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$times$ and 1.50$times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
arXiv Detail & Related papers (2024-08-13T08:32:06Z) - Decoding Speculative Decoding [4.56754610152086]
Speculative Decoding is a technique to speed up inference for Large Language Models without sacrificing quality.<n>We study over 350 experiments with LLaMA-65B and OPT-66B using speculative decoding.<n>Our newly designed draft model can provide 111% higher throughput than existing draft models.
arXiv Detail & Related papers (2024-02-02T16:15:24Z) - Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming.
One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model.
This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification.
We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z) - DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens.
We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD.
We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z) - Online Speculative Decoding [34.987825705622555]
We introduce online speculative decoding to accelerate the inference of large language models.
The main idea is to continuously update the (multiple) draft model(s) on observed user query data.
We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data.
arXiv Detail & Related papers (2023-10-11T04:03:42Z) - A Frustratingly Easy Improvement for Position Embeddings via Random
Padding [68.75670223005716]
In this paper, we propose a simple but effective strategy, Random Padding, without any modifications to existing pre-trained language models.
Experiments show that Random Padding can significantly improve model performance on the instances whose answers are located at rear positions.
arXiv Detail & Related papers (2023-05-08T17:08:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.