Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards
- URL: http://arxiv.org/abs/2508.10548v1
- Date: Thu, 14 Aug 2025 11:37:02 GMT
- Title: Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards
- Authors: Zetian Sun, Dongfang Li, Zhuoen Chen, Yuhuai Qin, Baotian Hu,
- Abstract summary: We introduce the SWE-oriented RL Framework, a unified system supporting multi-turn interaction, docker-based execution, and customizable reward functions.<n>We also propose Gated Reward Accumulation (G-RA), a novel method that accumulates immediate rewards only when high-level (long-term) rewards meet a predefined threshold.
- Score: 13.70228195630989
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Reward sparsity in long-horizon reinforcement learning (RL) tasks remains a significant challenge, while existing outcome-based reward shaping struggles to define meaningful immediate rewards without introducing bias or requiring explicit task decomposition. Alternatively, verification-based reward shaping uses stepwise critics, but misalignment between immediate rewards and long-term objectives can lead to reward hacking and suboptimal policies. In this work, we address this problem in the context of software engineering (SWE) tasks, where multi-turn reasoning and rule-based verification are critical. We introduce the SWE-oriented RL Framework, a unified system supporting multi-turn interaction, docker-based execution, and customizable reward functions. Additionally, we propose Gated Reward Accumulation (G-RA), a novel method that accumulates immediate rewards only when high-level (long-term) rewards meet a predefined threshold, ensuring stable RL optimization. Experiments on SWE-bench Verified and kBench demonstrate that G-RA leads to an increase in completion rates (47.6\% \rightarrow 93.8\% and 22.0\% \rightarrow 86.0\%) and modification rates (19.6\% \rightarrow 23.8\% and 12.0\% \rightarrow 42.0\%), while avoiding policy degradation caused by reward misalignment. Our findings highlight the importance of balanced reward accumulation in long-horizon RL and provide a practical solution.
Related papers
- IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking [67.20568716300272]
Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking.<n>We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models.<n>We show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.
arXiv Detail & Related papers (2026-02-23T01:14:53Z) - SPARK: Synergistic Policy And Reward Co-Evolving Framework [84.22494672256894]
We introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR.<n>Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model.<n>We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks.
arXiv Detail & Related papers (2025-09-26T17:50:12Z) - Agentic Reinforcement Learning with Implicit Step Rewards [92.26560379363492]
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL)<n>We introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms.<n>We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
arXiv Detail & Related papers (2025-09-23T16:15:42Z) - Stable Reinforcement Learning for Efficient Reasoning [2.838966689544288]
GRPO-$lambda$ is an efficient and stabilized variant of GRPO.<n>It dynamically adjusts the reward strategy by monitoring the correctness ratio.<n>It improves average accuracy by 1.48% while reducing CoT sequence length by 47.3%.
arXiv Detail & Related papers (2025-05-23T16:43:03Z) - Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners [15.25763345316458]
Reinforcement learning agents are fundamentally limited by the quality of the reward functions they learn from.<n>We introduce the Trajectory Alignment Coefficient to quantify the similarity between a human stakeholder's ranking of trajectory distributions and those induced by a given reward function.
arXiv Detail & Related papers (2025-03-08T00:38:17Z) - Process Reinforcement through Implicit Rewards [95.7442934212076]
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs)<n>Dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards.<n>This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive.<n>We propose PRIME, which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards
arXiv Detail & Related papers (2025-02-03T15:43:48Z) - Reinforcement Learning from Bagged Reward [46.16904382582698]
In Reinforcement Learning (RL), it is commonly assumed that an immediate reward signal is generated for each action taken by the agent.
In many real-world scenarios, designing immediate reward signals is difficult.
We propose a novel reward redistribution method equipped with a bidirectional attention mechanism.
arXiv Detail & Related papers (2024-02-06T07:26:44Z) - The Distributional Reward Critic Framework for Reinforcement Learning Under Perturbed Rewards [31.550669983576544]
The reward signal plays a central role in defining the desired behaviors of agents in reinforcement learning.<n>We introduce a distributional reward critic framework for estimating reward distributions and perturbations during training.<n>Our results broaden and deepen our ability to perform RL in reward-perturbed environments.
arXiv Detail & Related papers (2024-01-11T07:25:28Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - The Effects of Reward Misspecification: Mapping and Mitigating
Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied.
We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time.
We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.