Related papers: FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning

FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning

URL: http://arxiv.org/abs/2509.21792v1
Date: Fri, 26 Sep 2025 02:48:41 GMT
Title: FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning
Authors: Yizhou Zhang, Ning Lv, Teng Wang, Jisheng Dang,
Abstract summary: Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models.<n>We propose a speculative decoding framework that adjusts the drafting and verification strategy according to real-time levels.<n>We show that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency.
Score: 11.68914161151634
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models (LLMs) via reinforcement learning. However, its practical deployment is impeded by an excessively slow training process, primarily attributed to the computationally intensive autoregressive generation of multiple responses per query, which makes the generation phase the primary performance bottleneck. Although speculative decoding presents a promising direction for acceleration, its direct application in GRPO achieves limited speedup under high-concurrency training conditions. To overcome this limitation, we propose a concurrency-aware speculative decoding framework that dynamically adjusts the drafting and verification strategy according to real-time concurrency levels, thereby maximizing the acceleration of the generation process. Furthermore, to address performance degradation arising from distributional drift between the evolving target model and the fixed draft model during training, we introduce an online draft learning mechanism that enables the draft model to continuously adapt using feedback signals from the target model. Experimental results across multiple mathematical reasoning datasets and models demonstrate that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency. The code is available at https://github.com/yedaotian9/GRPO_speculative.

Related papers

Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generation [65.13627721310613]
Mean velocity policy (MVP) is a new generative policy function that models the mean velocity field to achieve the fastest one-step action generation.<n>MVP achieves state-of-the-art success rates across several challenging robotic manipulation tasks from Robomimic and OGBench.
arXiv Detail & Related papers (2026-02-14T14:44:06Z)
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering [52.67783579040657]
AceGRPO is a machine learning system that prioritizes tasks at the agent's learning frontier to maximize learning efficiency.<n>Our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines.
arXiv Detail & Related papers (2026-02-08T10:55:03Z)
RL-VLA$^3$: Reinforcement Learning VLA Accelerating via Full Asynchronism [42.27384804295299]
Vision-Language-Action (VLA) models have emerged as a crucial pathway towards general embodied intelligence.<n>This paper proposes and implements a fully-asynchronous policy training framework encompassing the entire pipeline from environment interaction to actor policy updates.<n>On the LIBERO benchmark, the framework achieves throughput improvements of up to 59.25% compared to existing synchronous strategies.
arXiv Detail & Related papers (2026-02-05T15:30:23Z)
TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference [1.0091292967761423]
TIDE is a serving-engine-native framework that integrates online draft adaptation directly into high-performance LLM inference systems.<n>TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model.<n>Across diverse real-world workloads, TIDE achieves up to 1.15x throughput improvement over static speculative decoding.
arXiv Detail & Related papers (2026-02-05T00:06:12Z)
Generative Actor Critic [74.04971271003869]
Generative Actor Critic (GAC) is a novel framework that decouples sequential decision-making by reframing textitpolicy evaluation as learning a generative model of the joint distribution over trajectories and returns.<n>Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC's strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-12-25T06:31:11Z)
DiRL: An Efficient Post-Training Framework for Diffusion Language Models [54.405206032785706]
Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models.<n>Existing methods suffer from computational inefficiency and objective mismatches between training and inference.<n>We introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference.
arXiv Detail & Related papers (2025-12-23T08:33:19Z)
Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter [52.111923076688505]
Training Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving.<n>We propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding.
arXiv Detail & Related papers (2025-11-20T18:59:25Z)
OM2P: Offline Multi-Agent Mean-Flow Policy [40.346958259814514]
We propose OM2P (Offline Multi-Agent Mean-Flow Policy), a novel offline MARL algorithm to achieve efficient one-step sampling action.<n>We show that OM2P achieves superior performance, with up to a 3.8x reduction in GPU memory usage and up to a 10.8x speed-up in training time.
arXiv Detail & Related papers (2025-08-08T12:38:56Z)
Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? [62.579951798437115]
This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
arXiv Detail & Related papers (2025-07-07T03:20:52Z)
World Models as Reference Trajectories for Rapid Motor Adaptation [0.0]
Reflexive World Models (RWM) is a dual control framework that uses world model predictions as implicit reference trajectories for rapid adaptation.<n>Our method separates the control problem into long-term reward through reinforcement learning and robust motor execution.
arXiv Detail & Related papers (2025-05-21T14:46:41Z)
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [55.0194604505437]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.<n>This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z)
Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding [2.642212767247493]
We introduce Adaptive N-gram Parallel Decoding (ANPD), which accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD preserves the integrity of the original output while enhancing processing speed. In experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x.
arXiv Detail & Related papers (2024-04-10T16:11:09Z)
Multiplicative update rules for accelerating deep learning training and increasing robustness [69.90473612073767]
We propose an optimization framework that fits to a wide range of machine learning algorithms and enables one to apply alternative update rules. We claim that the proposed framework accelerates training, while leading to more robust models in contrast to traditionally used additive update rule.
arXiv Detail & Related papers (2023-07-14T06:44:43Z)
When to Update Your Model: Constrained Model-based Reinforcement Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL) Our follow-up derived bounds reveal the relationship between model shifts and performance improvement. A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.