Related papers: Rethinking Reward Miscalibration of GRPO in Agentic RL

Rethinking Reward Miscalibration of GRPO in Agentic RL

URL: http://arxiv.org/abs/2509.23870v2
Date: Mon, 13 Oct 2025 08:28:52 GMT
Title: Rethinking Reward Miscalibration of GRPO in Agentic RL
Authors: Jingyu Liu, Xiaopeng Wu, Jingquan Peng, Kehan Chen, Chuan Yu, Lizhong Ding, Yong Liu,
Abstract summary: We show that outcome based reward ensures expected negative advantage for those flawed middle steps.<n>We propose training the actor to classify good or bad actions to separate the embedding of good/bad actions.
Score: 18.495499496405635
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Building autonomous agents capable of solving long-horizon, real-world tasks has garnered significant research interest. But outcome based rewards may cause reward miscalibration which means it might mistakenly allocate positive reward to flawed middle steps which is regarded as the key reason making the bad actions being reinforced during training. However we reveal that outcome based reward ensures expected negative advantage for those flawed middle steps, which means the flawed actions should be punished during training. Even accounting for the ``squeezing effect", the probability mass of good actions should increase and the actor should gradually get rid of harmful actions. This shows that flawed actions should be punished during training. We further identify gradient coupling between similar samples as a key issue in agentic RL, the input prompt is extremely similar and the output action space is limited, therefore during training, gradients from well-performing samples can inadvertently strengthen suboptimal or incorrect actions due to similar input observation and output actions. We show that with gradient coupling, some flawed actions might be enhanced. To address this, we propose training the actor to classify good or bad actions to separate the embedding of good/bad actions and alleviate the gradient interference, extensive experiments shows its effectiveness.

Related papers

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs)<n>This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld.<n>We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z)
Reducing Action Space for Deep Reinforcement Learning via Causal Effect Estimation [15.684669299728743]
We propose a method to improve exploration efficiency by estimating the causal effects of actions.<n>We first pre-train an inverse dynamics model to serve as prior knowledge of the environment.<n>We classify actions across the entire action space at each time step and estimate the causal effect of each action to suppress redundant actions.
arXiv Detail & Related papers (2025-01-24T14:47:33Z)
Preemptive Detection and Correction of Misaligned Actions in LLM Agents [58.39520480675366]
InferAct is a novel approach to detect misaligned actions before execution.<n>It alerts users for timely correction, preventing adverse outcomes.<n>InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
arXiv Detail & Related papers (2024-07-16T15:24:44Z)
Can Active Sampling Reduce Causal Confusion in Offline Reinforcement Learning? [58.942118128503104]
Causal confusion is a phenomenon where an agent learns a policy that reflects imperfect spurious correlations in the data. This phenomenon is particularly pronounced in domains such as robotics. In this paper, we study causal confusion in offline reinforcement learning.
arXiv Detail & Related papers (2023-12-28T17:54:56Z)
Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis [50.926791529605396]
We introduce Counterfactual Contribution Analysis (COCOA), a new family of model-based credit assignment algorithms. Our algorithms achieve precise credit assignment by measuring the contribution of actions upon obtaining subsequent rewards.
arXiv Detail & Related papers (2023-06-29T09:27:27Z)
Distributional Reward Estimation for Effective Multi-Agent Deep Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL) Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training. The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z)
The Equalization Losses: Gradient-Driven Training for Long-tailed Object Recognition [84.51875325962061]
We propose a gradient-driven training mechanism to tackle the long-tail problem. We introduce a new family of gradient-driven loss functions, namely equalization losses. Our method consistently outperforms the baseline models.
arXiv Detail & Related papers (2022-10-11T16:00:36Z)
Utilizing Skipped Frames in Action Repeats via Pseudo-Actions [13.985534521589253]
In many deep reinforcement learning settings, when an agent takes an action, it repeats the same action a predefined number of times without observing the states until the next action-decision point. Since the amount of training data is inversely proportional to the interval of action repeats, they can have a negative impact on the sample efficiency of training. We propose a simple but effective approach to alleviate this problem by introducing the concept of pseudo-actions.
arXiv Detail & Related papers (2021-05-07T02:43:44Z)
Combating False Negatives in Adversarial Imitation Learning [67.99941805086154]
In adversarial imitation learning, a discriminator is trained to differentiate agent episodes from expert demonstrations representing the desired behavior. As the trained policy learns to be more successful, the negative examples become increasingly similar to expert ones. We propose a method to alleviate the impact of false negatives and test it on the BabyAI environment.
arXiv Detail & Related papers (2020-02-02T14:56:39Z)
Effects of sparse rewards of different magnitudes in the speed of learning of model-based actor critic methods [0.4640835690336653]
We show that we can influence an agent to learn faster by applying an external environmental pressure during training. Results have been shown to be valid for Deep Deterministic Policy Gradients using Hindsight Experience Replay in a well known Mujoco environment.
arXiv Detail & Related papers (2020-01-18T20:52:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.