When RL Meets Adaptive Speculative Training: A Unified Training-Serving System
- URL: http://arxiv.org/abs/2602.06932v1
- Date: Fri, 06 Feb 2026 18:28:54 GMT
- Title: When RL Meets Adaptive Speculative Training: A Unified Training-Serving System
- Authors: Junxiong Wang, Fengxiang Bie, Jisen Li, Zhongzhu Zhou, Zelei Shao, Yubo Wang, Yinghui Liu, Qingyang Wu, Avner May, Sri Yanamandra, Yineng Zhang, Ce Zhang, Tri Dao, Percy Liang, Ben Athiwaratkun, Shuaiwen Leon Song, Chenfeng Xu, Xiaoxia Wu,
- Abstract summary: We present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces.<n>Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption.
- Score: 71.98182665273575
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem: accepted tokens provide positive feedback, while rejected speculator proposals provide implicit negative feedback that we exploit to improve sample efficiency. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax M2.1 229B and Qwen3-Coder-Next 80B). Aurora also adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3).
Related papers
- TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference [1.0091292967761423]
TIDE is a serving-engine-native framework that integrates online draft adaptation directly into high-performance LLM inference systems.<n>TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model.<n>Across diverse real-world workloads, TIDE achieves up to 1.15x throughput improvement over static speculative decoding.
arXiv Detail & Related papers (2026-02-05T00:06:12Z) - HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network [50.33808558714122]
Large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy.<n>We propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network.<n> Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions.
arXiv Detail & Related papers (2026-01-16T07:37:23Z) - RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure [49.88201789074532]
Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning.<n>We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure.
arXiv Detail & Related papers (2025-12-27T11:14:23Z) - Offline Reinforcement Learning for End-to-End Autonomous Driving [1.2891210250935148]
End-to-end (E2E) autonomous driving models take only camera images as input and directly predict a future trajectory.<n>Online reinforcement learning (RL) could mitigate IL-induced issues.<n>We introduce a camera-only E2E offline RL framework that performs no additional exploration and trains solely on a fixed simulator dataset.
arXiv Detail & Related papers (2025-12-21T09:21:04Z) - RAST-MoE-RL: A Regime-Aware Spatio-Temporal MoE Framework for Deep Reinforcement Learning in Ride-Hailing [11.542008509248836]
Regime-of-Experts (RAST-MoE) formalizes adaptive delayed matching as a regime-aware MDP equipped with a self-attention MoE encoder.<n>A physics-informed congestion preserves realistic density-speed feedback, enabling millions of efficient rollouts, while an adaptive reward scheme guards against pathological strategies.
arXiv Detail & Related papers (2025-12-13T20:49:15Z) - ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems [36.535922134181995]
Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage.<n>We present ReSpec, a system that adapts Speculative decoding (SD) to RL through three complementary mechanisms.<n>On Qwen models (3B--14B), ReSpec achieves up to 4.5x speedup while preserving reward convergence and training stability.
arXiv Detail & Related papers (2025-10-30T13:27:42Z) - FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning [11.68914161151634]
Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models.<n>We propose a speculative decoding framework that adjusts the drafting and verification strategy according to real-time levels.<n>We show that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency.
arXiv Detail & Related papers (2025-09-26T02:48:41Z) - UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning [78.86567400365392]
We present Semi-online Reinforcement Learning, a novel paradigm that simulates online RL on offline trajectories.<n>To capture long-term training signals, Semi-online RL introduces discounted future returns into the reward computation.<n>Experiments show that ours Semi-online RL achieves SOTA performance among 7B models across four dynamic benchmarks.
arXiv Detail & Related papers (2025-09-15T03:24:08Z) - Reinforcement Learning for Machine Learning Engineering Agents [52.03168614623642]
We show that agents backed by weaker models that improve via reinforcement learning can outperform agents backed by much larger, but static models.<n>We propose duration- aware gradient updates in a distributed asynchronous RL framework to amplify high-cost but high-reward actions.<n>We also propose environment instrumentation to offer partial credit, distinguishing almost-correct programs from those that fail early.
arXiv Detail & Related papers (2025-09-01T18:04:10Z) - Unlocking FedNL: Self-Contained Compute-Optimized Implementation [56.16884466478886]
Federated Learning (FL) is an emerging paradigm that enables intelligent agents to collaboratively train Machine Learning (ML) models in a distributed manner.<n>Recent work introduces a family of Federated Newton Learn (FedNL) algorithms, marking a significant step towards applying second-order methods to FL and large-scale optimization.<n>We present a self-contained implementation of FedNL, FedNL-LS, FedNL-PP for single-node and multi-node settings.
arXiv Detail & Related papers (2024-10-11T12:19:18Z) - Efficient Motion Prediction: A Lightweight & Accurate Trajectory Prediction Model With Fast Training and Inference Speed [56.27022390372502]
We propose a new efficient motion prediction model, which achieves highly competitive benchmark results while training only a few hours on a single GPU.
Its low inference latency makes it particularly suitable for deployment in autonomous applications with limited computing resources.
arXiv Detail & Related papers (2024-09-24T14:58:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.