Related papers: RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents

RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents

URL: http://arxiv.org/abs/2602.03025v1
Date: Tue, 03 Feb 2026 02:47:32 GMT
Title: RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents
Authors: Haitian Zhong, Jixiu Zhai, Lei Song, Jiang Bian, Qiang Liu, Tieniu Tan,
Abstract summary: Multi-turn tool calling is challenging for Large Language Models because rewards are sparse and exploration is expensive.<n>A common recipe, SFT followed by GRPO, can stall when within-group reward variation is low.<n>We propose RC- GRPO, which treats exploration as a controllable steering problem via discrete reward tokens.
Score: 40.88916135445381
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-turn tool calling is challenging for Large Language Models (LLMs) because rewards are sparse and exploration is expensive. A common recipe, SFT followed by GRPO, can stall when within-group reward variation is low (e.g., more rollouts in a group receive the all 0 or all 1 reward), making the group-normalized advantage uninformative and yielding vanishing updates. To address this problem, we propose RC-GRPO (Reward-Conditioned Group Relative Policy Optimization), which treats exploration as a controllable steering problem via discrete reward tokens. We first fine-tune a Reward-Conditioned Trajectory Policy (RCTP) on mixed-quality trajectories with reward goal special tokens (e.g., <|high_reward|>, <|low_reward|>) injected into the prompts, enabling the model to learn how to generate distinct quality trajectories on demand. Then during RL, we sample diverse reward tokens within each GRPO group and condition rollouts on the sampled token to improve within-group diversity, improving advantage gains. On the Berkeley Function Calling Leaderboard v4 (BFCLv4) multi-turn benchmark, our method yields consistently improved performance than baselines, and the performance on Qwen-2.5-7B-Instruct even surpasses all closed-source API models.

Related papers

iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z)
Self-Hinting Language Models Enhance Reinforcement Learning [37.311361929798714]
We propose self-hint aligned GRPO with privileged supervision (SAGE)<n>SAGE injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward.<n> Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO.
arXiv Detail & Related papers (2026-02-03T05:56:20Z)
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z)
$λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences [22.199479724764725]
We introduce a learnable parameter $lambda$ that adaptively controls token-level weighting.<n>We find that $lambda$-GRPO achieves consistent improvements over vanilla GRPO and DAPO.<n>These gains come without any modifications to the training data or additional computational cost.
arXiv Detail & Related papers (2025-10-08T10:39:07Z)
MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems [18.92779479033295]
Group Relative Policy Optimization has been shown to be an effective algorithm when an accurate reward model is available.<n>We propose MO-GRPO, an extension of GRPO with a simple normalization method to reweight the reward functions automatically according to the variances of their values.<n>We show that MO-GRPO ensures that all reward functions contribute evenly to the loss function while preserving the order of preferences.
arXiv Detail & Related papers (2025-09-26T08:32:22Z)
Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
We propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity.<n>Specifically, we employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We also reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z)
Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning [55.15106182268834]
Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models.<n>It faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive.<n>We introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts.
arXiv Detail & Related papers (2025-04-18T17:49:55Z)
Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach [2.8626097661711394]
Reinforcement Learning from Human Feedback has achieved notable success in steering models, but is complex and can be unstable.<n>Recent approaches such as Direct Preference Optimization (DPO) simplify preference-based fine-tuning but may introduce bias or trade-off certain objectives.<n>We propose a Group Relative Policy Optimization framework with a multi-label reward regression model to achieve safe and aligned language generation.
arXiv Detail & Related papers (2025-03-26T05:50:33Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.