Beat the long tail: Distribution-Aware Speculative Decoding for RL Training
- URL: http://arxiv.org/abs/2511.13841v1
- Date: Mon, 17 Nov 2025 19:02:12 GMT
- Title: Beat the long tail: Distribution-Aware Speculative Decoding for RL Training
- Authors: Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Alpay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, Junxiong Wang,
- Abstract summary: We propose a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs.<n>Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves.
- Score: 75.75462952580796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning(RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck:the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall clock time and a complementary opportunity; the availability of historical rollouts that reveal stable prompt level patterns across training epochs. Motivated by these observations, we propose DAS, a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: an adaptive, nonparametric drafter built from recent rollouts using an incrementally maintained suffix tree, and a length aware speculation policy that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post training without compromising learning quality.
Related papers
- Lightweight Latent Reasoning for Narrative Tasks [89.94576985780549]
Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces"<n>We propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with reinforcement learning.<n> LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model'skip' reasoning steps.
arXiv Detail & Related papers (2025-12-01T22:07:32Z) - Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter [52.111923076688505]
Training Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving.<n>We propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding.
arXiv Detail & Related papers (2025-11-20T18:59:25Z) - Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning [6.742598086990326]
Reinforcement Learning (RL) has become critical for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks.<n>We present Seer, a novel online context learning system that addresses these challenges by exploiting previously overlooked similarities in output lengths and generation patterns among requests sharing the same prompt.<n>Seer introduces three key techniques: divided rollout for dynamic load balancing, context-aware scheduling, and adaptive grouped speculative decoding.
arXiv Detail & Related papers (2025-11-18T16:12:21Z) - CoPRIS: Efficient and Stable Reinforcement Learning via Concurrency-Controlled Partial Rollout with Importance Sampling [11.252930904797]
We propose Concurrency- Controlled Partial Rollout with Importance Sampling (CoPRIS)<n>CoPRIS mitigates long-tail inefficiencies by maintaining a fixed number of concurrent rollouts, early-terminating once sufficient samples are collected, and reusing unfinished trajectories in subsequent rollouts.<n>Experiments show that CoPRIS achieves up to 1.94x faster training while maintaining comparable or superior performance to synchronous RL systems.
arXiv Detail & Related papers (2025-11-05T11:39:32Z) - DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning [134.03095505580276]
Doing Length pEnalty Right (DLER) is a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty.<n>DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy.
arXiv Detail & Related papers (2025-10-16T20:05:57Z) - BroRL: Scaling Reinforcement Learning via Broadened Exploration [88.69554867685243]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key ingredient for unlocking complex reasoning capabilities in large language models.<n>Recent work ProRL has shown promise in scaling RL by increasing the number of training steps.<n>We investigate a complementary paradigm for scaling RL, BroR-Lincreasing the number of rollouts per example to hundreds.
arXiv Detail & Related papers (2025-10-01T17:59:02Z) - DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding [66.40658898418316]
We present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass.<n>Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.
arXiv Detail & Related papers (2025-09-28T07:00:15Z) - SPEC-RL: Accelerating On-Policy Reinforcement Learning via Speculative Rollouts [35.82325476805143]
SPEC-RL is a framework that integrates SPECulative decoding with the RL rollout process.<n>It reduces rollout time by 2-3x without compromising policy quality.<n>As a purely rollout-stage enhancement, SPEC-RL integrates seamlessly with mainstream algorithms.
arXiv Detail & Related papers (2025-09-27T10:32:34Z) - History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL [14.506189610798929]
Reinforcement learning (RL) has emerged as a pivotal methodology for enhancing the reasoning capabilities of large language models (LLMs)<n>We introduce RhymeRL, an LLM RL system designed to accelerate RL training with two key innovations.<n>First, to enhance rollout generation, we present HistoSpec, a speculative decoding inference engine.<n>Second, to tackle rollout bubbles, we introduce HistoPipe, a two-tier scheduling strategy.
arXiv Detail & Related papers (2025-08-26T01:42:46Z) - Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning [55.41828729623907]
We present Writing-RL: an Adaptive Curriculum Reinforcement Learning framework to advance long-form writing capabilities beyond supervised fine-tuning.<n>The framework consists of three key components: Margin-aware Data Selection strategy that prioritizes samples with high learning potential, Pairwise Comparison Reward mechanism that provides discriminative learning signals, and Dynamic Reference Scheduling approach.
arXiv Detail & Related papers (2025-06-06T05:40:39Z) - StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs)<n>StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks.<n> Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z) - Demystifying Long Chain-of-Thought Reasoning in LLMs [46.352406501403465]
Long chains-of-thought (CoTs) enable strategies like backtracking and error correction.<n>Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities.<n>We identify the key factors that enable models to generate long CoT trajectories.
arXiv Detail & Related papers (2025-02-05T17:13:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.