RoRecomp: Enhancing Reasoning Efficiency via Rollout Response Recomposition in Reinforcement Learning
- URL: http://arxiv.org/abs/2509.25958v1
- Date: Tue, 30 Sep 2025 08:54:38 GMT
- Title: RoRecomp: Enhancing Reasoning Efficiency via Rollout Response Recomposition in Reinforcement Learning
- Authors: Gang Li, Yulei Qin, Xiaoyu Tan, Dingkang Yang, Yuchen Shi, Zihan Xu, Xiang Li, Xing Sun, Ke Li,
- Abstract summary: Reinforcement learning with verifiable rewards (RLVR) has proven effective in eliciting complex reasoning in large language models (LLMs)<n>We propose Rollout Response Recomposition (RoRecomp), a plug-and-play method that guides models toward concise reasoning by strategically recomposing the training data.
- Score: 45.850748558862364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning with verifiable rewards (RLVR) has proven effective in eliciting complex reasoning in large language models (LLMs). However, standard RLVR training often leads to excessively verbose processes (in reasoning tasks) and inefficient exploration trajectories (in agentic settings), as outcome-only rewards provide no incentive for efficiency and the high variance in response length within relatively small rollout groups results in noisy optimization signals. To address this, we propose Rollout Response Recomposition (RoRecomp), a plug-and-play method that guides models toward concise reasoning by strategically recomposing the training data. RoRecomp separates responses into two distinct batch types: 1) priority batches, which combine short-correct and long-incorrect responses selected from online batches to provide a clear gradient signal for brevity, and 2) compensation batches, which utilize remaining responses from a replay buffer to maintain stability and prevent model collapse. To comprehensively evaluate effectiveness, we test RoRecomp across three settings where results demonstrate substantial efficiency gains: reducing reasoning length by 27.7% in zero RL training, reducing unnecessary tool calls by 46.8% while improving accuracy in agentic RL, and achieving up to 52.5% length reduction in thinking compression, all with minimal performance impact.
Related papers
- Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning [66.22060690012512]
Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy.<n>We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution.
arXiv Detail & Related papers (2026-02-27T20:23:59Z) - Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance [39.88432082145946]
We propose SCOPE, a novel framework that utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts.<n>By applying precise refinement on partially correct rollout, our method effectively salvages partially correct trajectories and increases diversity score by 13.5%.<n>Our approach establishes new state-of-the-art results, achieving an average accuracy of 46.6% on math reasoning and exhibiting robust generalization with 53.4% accuracy on out-of-distribution reasoning tasks.
arXiv Detail & Related papers (2026-02-27T15:49:23Z) - Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards [69.74686029941881]
Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models.<n>We propose a unified neural scheduling framework that adaptively selects high-value rollouts throughout training.<n>Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.
arXiv Detail & Related papers (2026-02-09T10:51:58Z) - Beyond Correctness: Learning Robust Reasoning via Transfer [51.403609251508904]
We adopt a simple philosophical view, robust reasoning should remain useful beyond the mind that produced it.<n>We introduce Reinforcement Learning with Transferable Reward, which operationalizes robustness via transfer reward.<n>Our approach improves sampling consistency while improving final answer accuracy, and it reaches comparable performance in substantially fewer training steps.
arXiv Detail & Related papers (2026-02-09T10:41:44Z) - Efficient Reasoning via Reward Model [24.105621725286497]
Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs)<n>LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking.<n>We introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score.
arXiv Detail & Related papers (2025-11-12T09:51:07Z) - Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models [61.78513830395669]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs)<n>As models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal.<n>We propose the Explore Residual Prompts in Policy Optimization framework, which encourages exploration on residual prompts and reactivates their training signals.
arXiv Detail & Related papers (2025-11-06T20:40:27Z) - SPARK: Synergistic Policy And Reward Co-Evolving Framework [84.22494672256894]
We introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR.<n>Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model.<n>We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks.
arXiv Detail & Related papers (2025-09-26T17:50:12Z) - Self-Aligned Reward: Towards Effective and Efficient Reasoners [16.740993145927774]
Self-aligned reward (SAR) is a self-guided signal that complements verifiable rewards to encourage both reasoning accuracy and efficiency.<n>We show that SAR reliably distinguishes answer quality: concise, correct answers score higher than redundant ones, and partially correct answers score higher than entirely incorrect ones.
arXiv Detail & Related papers (2025-09-05T20:39:43Z) - Learning to Reason for Factuality [48.08503522255537]
We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance.<n>Our model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.
arXiv Detail & Related papers (2025-08-07T17:57:09Z) - Stable Reinforcement Learning for Efficient Reasoning [2.838966689544288]
GRPO-$lambda$ is an efficient and stabilized variant of GRPO.<n>It dynamically adjusts the reward strategy by monitoring the correctness ratio.<n>It improves average accuracy by 1.48% while reducing CoT sequence length by 47.3%.
arXiv Detail & Related papers (2025-05-23T16:43:03Z) - When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning [20.233873556056487]
Large reasoning models (LRMs) achieve remarkable performance via long reasoning chains, but often incur excessive computational overhead due to redundant reasoning.<n>We propose Adaptive Self-Recovery Reasoning (ASRR), a framework that suppresses unnecessary reasoning and enables implicit recovery.<n>Our results highlight the potential of ASRR for enabling efficient, adaptive, and safer reasoning in LRMs.
arXiv Detail & Related papers (2025-05-21T11:41:39Z) - Efficient RL Training for Reasoning Models via Length-Aware Optimization [104.97188611117353]
We propose three critical reward designs integrated directly into the reinforcement learning process of large reasoning models.<n>Our method significantly decreases response length while maintaining or even improving performance.
arXiv Detail & Related papers (2025-05-18T07:46:43Z) - HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback [47.12549302721597]
We propose Hybrid Reinforcement Learning from AI Feedback (HRLAIF)
This method enhances the accuracy of AI annotations for responses, making the model's helpfulness more robust in training process.
HRLAIF inherits the ability of RLAIF to enhance human preference for outcomes at a low cost while also improving the satisfaction rate of responses.
arXiv Detail & Related papers (2024-03-13T07:38:20Z) - Solving Offline Reinforcement Learning with Decision Tree Regression [0.0]
This study presents a novel approach to addressing offline reinforcement learning problems by reframing them as regression tasks.
We introduce two distinct frameworks: return-conditioned and return-weighted decision tree policies.
Despite the simplification inherent in this reformulated approach to offline RL, our agents demonstrate performance that is at least on par with the established methods.
arXiv Detail & Related papers (2024-01-21T23:50:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.