Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
- URL: http://arxiv.org/abs/2511.14617v2
- Date: Sun, 23 Nov 2025 13:41:11 GMT
- Title: Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
- Authors: Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, Mingxing Zhang,
- Abstract summary: Reinforcement Learning (RL) has become critical for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks.<n>We present Seer, a novel online context learning system that addresses these challenges by exploiting previously overlooked similarities in output lengths and generation patterns among requests sharing the same prompt.<n>Seer introduces three key techniques: divided rollout for dynamic load balancing, context-aware scheduling, and adaptive grouped speculative decoding.
- Score: 6.742598086990326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning (RL) has become critical for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks. The rollout phase, which dominates end-to-end iteration time, suffers from substantial long-tail latency and poor resource utilization due to inherent workload imbalance. We present Seer, a novel online context learning system that addresses these challenges by exploiting previously overlooked similarities in output lengths and generation patterns among requests sharing the same prompt. Seer introduces three key techniques: divided rollout for dynamic load balancing, context-aware scheduling, and adaptive grouped speculative decoding. Together, these mechanisms substantially reduce long-tail latency and improve resource efficiency during rollout. Evaluations on production-grade RL workloads demonstrate that Seer improves end-to-end rollout throughput by 74% to 97% and reduces long-tail latency by 75% to 93% compared to state-of-the-art synchronous RL systems, significantly accelerating RL training iterations.
Related papers
- Role-Based Fault Tolerance System for LLM RL Post-Training [24.163081819380647]
RL post-training interleaves training and inference workloads, exposing the system to faults from both sides.<n>Our key insight is role-based fault isolation so the failure in one machine does not affect the others.<n>We present RobustRL, the first comprehensive robust system to handle GPU machine errors for RL post-training Effective Training Time Ratio improvement.
arXiv Detail & Related papers (2025-12-27T06:30:18Z) - Beat the long tail: Distribution-Aware Speculative Decoding for RL Training [75.75462952580796]
We propose a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs.<n>Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves.
arXiv Detail & Related papers (2025-11-17T19:02:12Z) - CoPRIS: Efficient and Stable Reinforcement Learning via Concurrency-Controlled Partial Rollout with Importance Sampling [11.252930904797]
We propose Concurrency- Controlled Partial Rollout with Importance Sampling (CoPRIS)<n>CoPRIS mitigates long-tail inefficiencies by maintaining a fixed number of concurrent rollouts, early-terminating once sufficient samples are collected, and reusing unfinished trajectories in subsequent rollouts.<n>Experiments show that CoPRIS achieves up to 1.94x faster training while maintaining comparable or superior performance to synchronous RL systems.
arXiv Detail & Related papers (2025-11-05T11:39:32Z) - ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems [36.535922134181995]
Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage.<n>We present ReSpec, a system that adapts Speculative decoding (SD) to RL through three complementary mechanisms.<n>On Qwen models (3B--14B), ReSpec achieves up to 4.5x speedup while preserving reward convergence and training stability.
arXiv Detail & Related papers (2025-10-30T13:27:42Z) - RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs [48.94639777633359]
We present RLBoost, a systematic solution for cost-efficient RL training that harvests preemptible GPU resources.<n> RLBoost increases training throughput by 1.51x-1.97x while improving cost efficiency by 28%-49% compared to using only on-demand GPU resources.
arXiv Detail & Related papers (2025-10-22T04:19:37Z) - QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs [80.76334908639745]
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs)<n>QeRL addresses issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA)<n>Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase.
arXiv Detail & Related papers (2025-10-13T17:55:09Z) - RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training [19.00988498482758]
Reinforcement Learning (RL) is a pivotal post-training technique for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Many RL systems attempt to alleviate this problem by relaxing synchronization, but this can compromise accuracy training.<n>We introduce tail, a novel rollout scheduling strategy for synchronous RL that systematically consolidates prompts leading to long-tail responses into a small subset of rollout steps (long rounds)<n>RollPacker achieves a 2.03x-2.56x end-to-end training time reduction compared to veRL and up to 2.24x speedup compared to RLHFuse for the Qwen2.5
arXiv Detail & Related papers (2025-09-25T11:13:22Z) - Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [65.14124923451077]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z) - StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs)<n>StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks.<n> Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z) - SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning [51.10866035483686]
High update-to-data (UTD) ratio algorithms in reinforcement learning (RL) improve sample efficiency but incur high computational costs, limiting real-world scalability.<n>We propose Offline Stabilization Phases for Efficient Q-Learning (SPEQ), an RL algorithm that combines low-UTD online training with periodic offline stabilization phases.<n>During these phases, Q-functions are fine-tuned with high UTD ratios on a fixed replay buffer, reducing redundant updates on suboptimal data.
arXiv Detail & Related papers (2025-01-15T09:04:19Z) - Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models [11.624678008637623]
We propose separating generation and learning in RLHF.<n>Online DPO is found to be most robust to off-policy data.<n>Asynchronous training relies on an underexplored regime, online but off-policy RLHF.
arXiv Detail & Related papers (2024-10-23T19:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.