Related papers: OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

URL: http://arxiv.org/abs/2507.02659v2
Date: Thu, 31 Jul 2025 21:00:28 GMT
Title: OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding
Authors: Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang,
Abstract summary: We propose OmniDraft, a unified framework that enables a single draft model to operate with any target model.<n>We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models.<n>We showcase the proficiency of the framework by performing online learning on math reasoning, coding and text generation tasks.
Score: 8.589209709453026
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit{``one drafter for all''} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.

Related papers

Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter [52.111923076688505]
Training Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving.<n>We propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding.
arXiv Detail & Related papers (2025-11-20T18:59:25Z)
TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs [12.056664630923896]
Speculative decoding substantially improves inference efficiency.<n>It is limited by a fundamental constraint: the draft and target models must share the same vocabulary.<n>We propose the algorithm TokenTiming for universal speculative decoding.
arXiv Detail & Related papers (2025-10-17T11:25:36Z)
CARD: Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference [19.14564724894706]
We propose a speculative decoding framework employing a 'query-and-correct' paradigm.<n> CARD decouples drafting and verification: the draft model generates candidate tokens to populate a shared cache, while the target model concurrently rectifies the draft model's generation direction.<n>Our approach achieves up to 4.83 speedup over vanilla decoding without requiring fine-tuning of either the draft or target models.
arXiv Detail & Related papers (2025-08-06T14:02:10Z)
Mamba Drafters for Speculative Decoding [58.080550222549064]
We introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM)<n>By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods.<n>We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates.
arXiv Detail & Related papers (2025-06-01T22:52:47Z)
DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting [59.57151419673759]
Speculative decoding presents a draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity.<n>We propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively.<n>Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality.
arXiv Detail & Related papers (2025-03-02T08:27:48Z)
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches. In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z)
S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models [32.68002253527712]
We introduce a novel multi-target scenario for the deployment of draft models for faster inference. We present a novel, more efficient sorted speculative decoding mechanism that outperforms regular baselines in multi-target settings.
arXiv Detail & Related papers (2024-07-02T05:14:15Z)
Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs [11.245862832561176]
Training a high-quality draft model is required to enable inference acceleration via speculative decoding. We train Llama 2 Chat Drafter 115M, a draft model for Llama 2 Chat 7B or larger, with only 1.64% of the original size. Our results show that Llama 2 Chat Drafter 115M with speculative decoding achieves up to 2.3 block efficiency and 2.4$times$ speed-up.
arXiv Detail & Related papers (2024-02-29T19:55:06Z)
Towards Robust and Efficient Cloud-Edge Elastic Model Adaptation via Selective Entropy Distillation [56.79064699832383]
We establish a Cloud-Edge Elastic Model Adaptation (CEMA) paradigm in which the edge models only need to perform forward propagation. In our CEMA, to reduce the communication burden, we devise two criteria to exclude unnecessary samples from uploading to the cloud.
arXiv Detail & Related papers (2024-02-27T08:47:19Z)
DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens. We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD. We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z)
Online Speculative Decoding [34.987825705622555]
We introduce online speculative decoding to accelerate the inference of large language models. The main idea is to continuously update the (multiple) draft model(s) on observed user query data. We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data.
arXiv Detail & Related papers (2023-10-11T04:03:42Z)
Speculative Decoding with Big Little Decoder [108.95187338417541]
Big Little Decoder (BiLD) is a framework that can improve inference efficiency and latency for a wide range of text generation applications. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation. Our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture.
arXiv Detail & Related papers (2023-02-15T18:55:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.