Related papers: Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

URL: http://arxiv.org/abs/2511.12344v2
Date: Tue, 18 Nov 2025 20:39:14 GMT
Title: Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning
Authors: Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge, Yilong Xu, Jiafeng Guo, Xueqi Cheng,
Abstract summary: We propose $textbfRGR-GRPO (Reward and Guidance through rubrics), a framework for multi-domain reasoning.<n>RGR-GRPO consistently outperforms RL methods that rely solely on alternative reward schemes or offline guidance.
Score: 79.365697698062
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in reinforcement learning (RL) have significantly improved the complex reasoning capabilities of large language models (LLMs). Despite these successes, existing methods mainly focus on single-domain RL (e.g., mathematics) with verifiable rewards (RLVR), and their reliance on purely online RL frameworks restricts the exploration space, thereby limiting reasoning performance. In this paper, we address these limitations by leveraging rubrics to provide both fine-grained reward signals and offline guidance. We propose $\textbf{RGR-GRPO}$ (Reward and Guidance through Rubrics), a rubric-driven RL framework for multi-domain reasoning. RGR-GRPO enables LLMs to receive dense and informative rewards while exploring a larger solution space during GRPO training. Extensive experiments across 14 benchmarks spanning multiple domains demonstrate that RGR-GRPO consistently outperforms RL methods that rely solely on alternative reward schemes or offline guidance. Compared with verifiable online RL baseline, RGR-GRPO achieves average improvements of +7.0%, +5.4%, +8.4%, and +6.6% on mathematics, physics, chemistry, and general reasoning tasks, respectively. Notably, RGR-GRPO maintains stable entropy fluctuations during off-policy training and achieves superior pass@k performance, reflecting sustained exploration and effective breakthrough beyond existing performance bottlenecks.

Related papers

iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z)
DARL: Encouraging Diverse Answers for General Reasoning without Verifiers [41.35516261603945]
We propose DARL, a reinforcement learning framework that encourages the generation of diverse answers within a controlled deviation range from the reference.<n>Our framework is fully compatible with existing general reinforcement learning methods and can be seamlessly integrated without additional verifiers.
arXiv Detail & Related papers (2026-01-21T06:23:55Z)
XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation [8.511469090666077]
Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning.<n>Existing approaches suffer from limited exploration on challenging prompts and leave informative feedback signals underexploited.<n>This paper presents eXplore - eXploit GRPO, a unified framework that recasts policy optimization through the principled lens of rollout exploration-exploitation.
arXiv Detail & Related papers (2025-10-08T05:53:56Z)
Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models [22.50153462109328]
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs)<n>We introduce a Risk-Sensitive Reinforcement Learning framework.<n>Our approach employs a risk-seeking objective that interpolates between mean and maximum rewards, leading to a novel algorithm.<n>Remarkably, RS-GRPO is simple to implement, requiring only minor code modifications.
arXiv Detail & Related papers (2025-09-29T04:12:20Z)
Domain-Aware RAG: MoL-Enhanced RL for Efficient Training and Scalable Retrieval [5.640810636056805]
MoLER is a domain-aware RAG method that uses MoL-Enhanced Reinforcement Learning to optimize retrieval.<n>MoLER bridges the knowledge gap in RAG systems, enabling robust and scalable retrieval in specialized domains.
arXiv Detail & Related papers (2025-09-08T13:04:07Z)
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [111.1749164063616]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z)
EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework [10.632701939913007]
Group Relative Policy Optimization ( GRPO) improves efficiency but suffers from limited exploration and training instability.<n>We introduce EFRame, an Exploration-Filter-Replay framework that augments GRPO across three dimensions.<n>This unified framework establishes a principled training cycle that balances exploration, efficiency, and stability.
arXiv Detail & Related papers (2025-06-27T13:09:05Z)
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z)
Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective [82.24301452333577]
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning.<n>A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains.<n>We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains.
arXiv Detail & Related papers (2025-06-17T20:24:00Z)
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO [91.25793883692036]
We aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL)<n>We propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space.<n>In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants.
arXiv Detail & Related papers (2025-05-22T13:39:32Z)
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z)
Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.