Related papers: CoPRIS: Efficient and Stable Reinforcement Learning via Concurrency-Controlled Partial Rollout with Importance Sampling

CoPRIS: Efficient and Stable Reinforcement Learning via Concurrency-Controlled Partial Rollout with Importance Sampling

URL: http://arxiv.org/abs/2511.05589v1
Date: Wed, 05 Nov 2025 11:39:32 GMT
Title: CoPRIS: Efficient and Stable Reinforcement Learning via Concurrency-Controlled Partial Rollout with Importance Sampling
Authors: Zekai Qu, Yinxu Pan, Ao Sun, Chaojun Xiao, Xu Han,
Abstract summary: We propose Concurrency- Controlled Partial Rollout with Importance Sampling (CoPRIS)<n>CoPRIS mitigates long-tail inefficiencies by maintaining a fixed number of concurrent rollouts, early-terminating once sufficient samples are collected, and reusing unfinished trajectories in subsequent rollouts.<n>Experiments show that CoPRIS achieves up to 1.94x faster training while maintaining comparable or superior performance to synchronous RL systems.
Score: 11.252930904797
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) post-training has become a trending paradigm for enhancing the capabilities of large language models (LLMs). Most existing RL systems for LLMs operate in a fully synchronous manner, where training must wait for the rollout of an entire batch to complete. This design leads to severe inefficiencies, as extremely long trajectories can stall the entire rollout process and leave many GPUs idle. To address this issue, we propose Concurrency- Controlled Partial Rollout with Importance Sampling (CoPRIS), which mitigates long-tail inefficiencies by maintaining a fixed number of concurrent rollouts, early-terminating once sufficient samples are collected, and reusing unfinished trajectories in subsequent rollouts. To mitigate the impact of off-policy trajectories, we introduce Cross-stage Importance Sampling Correction, which concatenates buffered log probabilities from the previous policy with those recomputed under the current policy for importance sampling correction. Experiments on challenging mathematical reasoning benchmarks show that CoPRIS achieves up to 1.94x faster training while maintaining comparable or superior performance to synchronous RL systems. The code of CoPRIS is available at https://github.com/777pomingzi/CoPRIS.

Related papers

Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards [69.74686029941881]
Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models.<n>We propose a unified neural scheduling framework that adaptively selects high-value rollouts throughout training.<n>Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.
arXiv Detail & Related papers (2026-02-09T10:51:58Z)
HeaPA: Difficulty-Aware Heap Sampling and On-Policy Query Augmentation for LLM Reinforcement Learning [78.12979615107564]
HeaPA consistently improves accuracy and reaches target performance with fewer computations.<n>Our analyses suggest these gains come from frontier-focused sampling and on-policy pool growth.
arXiv Detail & Related papers (2026-01-30T01:31:17Z)
Beat the long tail: Distribution-Aware Speculative Decoding for RL Training [75.75462952580796]
We propose a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs.<n>Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves.
arXiv Detail & Related papers (2025-11-17T19:02:12Z)
RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training [19.00988498482758]
Reinforcement Learning (RL) is a pivotal post-training technique for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Many RL systems attempt to alleviate this problem by relaxing synchronization, but this can compromise accuracy training.<n>We introduce tail, a novel rollout scheduling strategy for synchronous RL that systematically consolidates prompts leading to long-tail responses into a small subset of rollout steps (long rounds)<n>RollPacker achieves a 2.03x-2.56x end-to-end training time reduction compared to veRL and up to 2.24x speedup compared to RLHFuse for the Qwen2.5
arXiv Detail & Related papers (2025-09-25T11:13:22Z)
History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL [14.506189610798929]
Reinforcement learning (RL) has emerged as a pivotal methodology for enhancing the reasoning capabilities of large language models (LLMs)<n>We introduce RhymeRL, an LLM RL system designed to accelerate RL training with two key innovations.<n>First, to enhance rollout generation, we present HistoSpec, a speculative decoding inference engine.<n>Second, to tackle rollout bubbles, we introduce HistoPipe, a two-tier scheduling strategy.
arXiv Detail & Related papers (2025-08-26T01:42:46Z)
Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [65.14124923451077]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z)
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning [23.24949857136035]
Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs)<n>We present AReaL, a fully asynchronous RL system that completely decouples generation from training.
arXiv Detail & Related papers (2025-05-30T07:18:25Z)
TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs [50.820065021136024]
DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs)<n>Recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings.<n>We propose TACO, a novel reinforcement learning algorithm for visual reasoning.
arXiv Detail & Related papers (2025-05-27T06:30:48Z)
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs)<n>StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks.<n> Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z)
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training [71.16258800411696]
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training.<n>Existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers.<n>We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA)
arXiv Detail & Related papers (2025-03-24T17:51:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.