XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation
- URL: http://arxiv.org/abs/2510.06672v2
- Date: Thu, 09 Oct 2025 01:39:47 GMT
- Title: XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation
- Authors: Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, Fan Lai,
- Abstract summary: Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning.<n>Existing approaches suffer from limited exploration on challenging prompts and leave informative feedback signals underexploited.<n>This paper presents eXplore - eXploit GRPO, a unified framework that recasts policy optimization through the principled lens of rollout exploration-exploitation.
- Score: 8.511469090666077
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging prompts and leave informative feedback signals underexploited, due to context-independent rollout allocation across prompts (e.g., generating 16 rollouts per prompt) and relying heavily on sparse rewards. This paper presents XRPO(eXplore - eXploit GRPO), a unified framework that recasts policy optimization through the principled lens of rollout exploration-exploitation. To enhance exploration, XRPO introduces a mathematically grounded rollout allocator that adaptively prioritizes prompts with higher potential for uncertainty reduction. It further addresses stagnation on zero-reward prompts through an in-context seeding strategy that injects curated exemplars, steering the model into more difficult reasoning trajectories. To strengthen exploitation, XRPO develops a group-relative, novelty-aware advantage sharpening mechanism that leverages sequence likelihoods to amplify low-probability yet correct responses, thereby extending the policy's reach beyond sparse rewards. Experiments across diverse math and coding benchmarks on both reasoning and non-reasoning models demonstrate that XRPO outperforms existing advances (e.g., GRPO and GSPO) up to 4% pass@1 and 6% cons@32, while accelerating training convergence by up to 2.7X.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning [67.45237332694025]
Group Relative Policy Optimization is effective for training language models on complex reasoning.<n>We propose Weakly Supervised GRPO, which improves rollout efficiency by converting terminal rewards into correctness-aware guidance.
arXiv Detail & Related papers (2026-02-19T02:43:35Z) - iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities [10.235183326885794]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs)<n>We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths.<n>We propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses.
arXiv Detail & Related papers (2026-02-05T04:06:55Z) - Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning [79.365697698062]
We propose $textbfRGR-GRPO (Reward and Guidance through rubrics), a framework for multi-domain reasoning.<n>RGR-GRPO consistently outperforms RL methods that rely solely on alternative reward schemes or offline guidance.
arXiv Detail & Related papers (2025-11-15T20:14:51Z) - Think Outside the Policy: In-Context Steered Policy Optimization [13.24687763539952]
In-context Steered Policy Optimization provides expert guidance using existing datasets.<n>ICPO consistently enhances reinforcement learning performance and training stability on mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-10-30T14:14:15Z) - Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration [8.839121572048018]
We propose RAPO, an algorithm to promote broader yet focused exploration.<n>We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset.<n>Results show that RAPO consistently improves problem-solving performance.
arXiv Detail & Related papers (2025-10-04T16:22:19Z) - $\text{G}^2$RPO: Granular GRPO for Precise Reward in Flow Models [74.21206048155669]
We propose a novel Granular-GRPO ($textG2$RPO ) framework that achieves precise and comprehensive reward assessments of sampling directions.<n>We also introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions.
arXiv Detail & Related papers (2025-10-02T12:57:12Z) - Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models [22.50153462109328]
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs)<n>We introduce a Risk-Sensitive Reinforcement Learning framework.<n>Our approach employs a risk-seeking objective that interpolates between mean and maximum rewards, leading to a novel algorithm.<n>Remarkably, RS-GRPO is simple to implement, requiring only minor code modifications.
arXiv Detail & Related papers (2025-09-29T04:12:20Z) - REX-RAG: Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation [35.0649927279081]
Reinforcement learning (RL) is emerging as a powerful paradigm for enabling large language models (LLMs) to perform complex reasoning tasks.<n>We propose REX-RAG, a framework that explores alternative reasoning paths while maintaining rigorous policy learning.<n>We evaluate it on seven question-answering benchmarks, and the experimental results show that REX-RAG achieves average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B.
arXiv Detail & Related papers (2025-08-11T16:25:25Z) - Exploration by Random Reward Perturbation [6.293868056239738]
We introduce Random Reward Perturbation (RRP), a novel exploration strategy for reinforcement learning (RL)<n>Our theoretical analyses demonstrate that adding zero-mean noise to environmental rewards effectively enhances policy diversity during training.<n>RRP is fully compatible with the action-perturbation-based exploration strategies, such as $epsilon$-greedy, policies, and entropy regularization.
arXiv Detail & Related papers (2025-06-10T12:34:00Z) - DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization [50.91849555841057]
Group Relative Policy Optimization is a reinforcement learning method for large reasoning models (LRMs)<n>We introduce a new Discriminative Constrained Optimization framework for reinforcing LRMs, grounded in the principle of discriminative learning.<n>DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO.
arXiv Detail & Related papers (2025-05-18T11:08:32Z) - A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.