Related papers: Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

URL: http://arxiv.org/abs/2603.01639v1
Date: Mon, 02 Mar 2026 09:17:48 GMT
Title: Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning
Authors: Jiebin Zhang, Zhenghan Yu, Liang Wang, Nan Yang, Eugene J. Yu, Zheng Li, Yifan Song, Dawei Zhu, Xingxing Zhang, Furu Wei, Sujian Li,
Abstract summary: We introduce Learning to Draft, a novel method that directly optimize for throughput of each draft-and-verify cycle.<n> LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to 36.4%.
Score: 67.88087883391475
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency. We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to 36.4%.

Related papers

Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback [59.287761696290865]
We propose a computationally efficient algorithm that achieves a sublinear regret guarantee for contextual episodic Markov Decision Processes (MDPs) with personalized feedback.<n>We demonstrate the effectiveness of our method in learning personalized objectives from multi-turn interactions through experiments on both a synthetic episodic MDP and a real-world user booking dataset.
arXiv Detail & Related papers (2026-02-09T06:29:54Z)
TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs [14.030784220154151]
We propose Test-time Adaptive Batched Ensemble Drafting (TABED) for Large Vision-Language Models.<n> TABED ensembles multiple drafts obtained via batch inference by leveraging deviations from past ground truths available in the SD setting.<n>It achieves an average robust walltime speedup of 1.74x over autoregressive decoding and a 5% improvement over single drafting methods.
arXiv Detail & Related papers (2026-01-28T08:16:57Z)
Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match [21.810129153556044]
Training-Free Loosely Speculative Decoding (FLy) is a novel method that loosens the rigid verification criterion.<n>We show that FLy preserves more than 99% of the target model's accuracy while achieving an average 2.81x speedup.
arXiv Detail & Related papers (2025-11-28T08:23:30Z)
One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow [56.13949180229929]
We introduce a one-step generative policy for offline reinforcement learning that maps noise directly to actions via a residual reformulation of MeanFlow.<n>Our method achieves strong performance in both offline and offline-to-online reinforcement learning settings.
arXiv Detail & Related papers (2025-11-17T06:34:17Z)
Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification [48.17448109580635]
We present a training-free alignment-augmented speculative decoding algorithm.<n>Our method achieves a mean acceptance length up to 2.39 and speed up generation by 2.23.
arXiv Detail & Related papers (2025-05-19T14:55:41Z)
Automatic Task Detection and Heterogeneous LLM Speculative Decoding [1.0485739694839669]
We propose a speculative decoding algorithm tailored for downstream task optimization.<n>It includes an automatic task partitioning and assigning method, which automatically categorizes downstream tasks into different sub-tasks.<n> Experimental results demonstrate that the proposed method improves draft accuracy by 6% to 50% over vanilla speculative decoding.
arXiv Detail & Related papers (2025-05-13T14:16:12Z)
CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter [9.631036588583248]
Speculative decoding is a powerful technique that accelerates Large Language Model (LLM) inference by leveraging a lightweight speculative draft model.<n>Recent methods have tried to solve this issue by adopting a multi-step training strategy, but the complex inputs of different training steps make it harder for the draft model to converge.<n>We propose CORAL, a novel framework that improves both accuracy and efficiency in speculative drafting.
arXiv Detail & Related papers (2025-02-24T06:28:26Z)
Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling [51.38330727868982]
We show how action chunking impacts the divergence between a learner and a demonstrator.<n>We propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop adaptation.<n>Our method boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z)
OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs. Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.