Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
- URL: http://arxiv.org/abs/2507.17746v2
- Date: Fri, 03 Oct 2025 01:55:55 GMT
- Title: Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
- Authors: Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, Sean Hendryx,
- Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for complex reasoning tasks with clear correctness signals such as math and coding.<n>Rubrics have recently been used in evaluation benchmarks to capture such judgments, but their potential as reward signals for on-policy post-training remains underexplored.<n>We introduce RaR, an on-policy reinforcement learning method that extends RLVR beyond verifiable domains by using rubric-based feedback.
- Score: 9.917318870162365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for complex reasoning tasks with clear correctness signals such as math and coding. However, extending it to real-world reasoning tasks is challenging, as evaluation depends on nuanced, multi-criteria judgments rather than binary correctness. Instance-specific rubrics have recently been used in evaluation benchmarks to capture such judgments, but their potential as reward signals for on-policy post-training remains underexplored. We introduce $\textbf{Rubrics as Rewards}$ (RaR), an on-policy reinforcement learning method that extends RLVR beyond verifiable domains by using rubric-based feedback. Across both medical and science domains, we evaluate multiple strategies for aggregating rubric feedback into rewards. The best RaR variant achieves relative improvements of up to $31\%$ on HealthBench and $7\%$ on GPQA-Diamond over popular LLM-as-judge baselines that rely on direct Likert-based rewards. These results demonstrate that RaR-trained policies adapt well to diverse evaluation formats, performing strongly on both rubric-based and multiple-choice tasks. Moreover, we find that using rubrics as structured reward signals yields better alignment for smaller judges and reduces performance variance across judge scales.
Related papers
- Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks [17.117706938140078]
We propose RRD, a principled framework for rubric refinement built on a decompose-filter cycle.<n> RRD decomposes coarse rubrics into fine-grained, discriminative criteria, expanding coverage while sharpening separation between responses.<n>It delivers large, consistent gains across both evaluation and training.
arXiv Detail & Related papers (2026-02-04T23:16:09Z) - From Absolute to Relative: Rethinking Reward Shaping in Group-Based Reinforcement Learning [7.6602542594279335]
We propose Reinforcement Learning with Relative Rewards to shift reward shaping from absolute scoring to relative ranking.<n>We show that RLRR yields consistent performance improvements over standard group-based baselines across reasoning benchmarks and open-ended generation tasks.
arXiv Detail & Related papers (2026-01-30T15:07:06Z) - From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
We propose reinforcement learning with verifiable reference-based rewards (RLVRR)<n>Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e., reward chain)<n>In this way, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts, and style, which evaluates adherence to stylistic properties.
arXiv Detail & Related papers (2026-01-26T14:39:58Z) - VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation [43.206705536310245]
We introduce textbfVeRPO (textbfVerifiable Dtextbfense textbfReward textbfPolicy textbfOptimization), a novel RL framework for code generation that synthesizes textitrobust and dense rewards fully grounded in verifiable execution feedback.<n>VeRPO consistently outperforms outcome-driven and RM-based baselines, achieving up to +8.83% gain in pass@1 with negligible time cost ( 0.02%) and zero
arXiv Detail & Related papers (2026-01-07T02:29:49Z) - Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning [79.365697698062]
We propose $textbfRGR-GRPO (Reward and Guidance through rubrics), a framework for multi-domain reasoning.<n>RGR-GRPO consistently outperforms RL methods that rely solely on alternative reward schemes or offline guidance.
arXiv Detail & Related papers (2025-11-15T20:14:51Z) - PROF: An LLM-based Reward Code Preference Optimization Framework for Offline Imitation Learning [29.373324685358753]
We propose PROF, a framework to generate and improve executable reward function codes from natural language descriptions and a single expert trajectory.<n>We also propose Reward Preference Ranking (RPR), a novel reward function quality assessment and ranking strategy without requiring environment interactions or RL training.
arXiv Detail & Related papers (2025-11-14T14:38:02Z) - GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z) - Intra-Trajectory Consistency for Reward Modeling [67.84522106537274]
We develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards.<n>We show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results.
arXiv Detail & Related papers (2025-06-10T12:59:14Z) - Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective [6.069069082518759]
We study the Zero-Reward Assumption in reinforcement learning for large language models (LLMs)<n>We show that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model.<n>We propose a new algorithm: Token-Reinforced Policy Optimization (TRePO)
arXiv Detail & Related papers (2025-06-03T07:44:31Z) - Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards [11.149294285483782]
We propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards.<n>We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm.<n>Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning.
arXiv Detail & Related papers (2025-05-30T14:34:57Z) - RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z) - Learning to Reason without External Rewards [100.27210579418562]
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision.<n>We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data.<n>We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal.
arXiv Detail & Related papers (2025-05-26T07:01:06Z) - A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards [26.40009657912622]
Reinforcement learning from human feedback (RLHF) is the mainstream paradigm used to align large language models (LLMs) with human preferences.
Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources.
In this work, we improve the effectiveness of the reward model by introducing a penalty term on the reward, named as textitcontrastive rewards
arXiv Detail & Related papers (2024-03-12T14:51:57Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Deep Reinforcement Learning from Hierarchical Preference Design [99.46415116087259]
This paper shows by exploiting certain structures, one can ease the reward design process.
We propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning.
arXiv Detail & Related papers (2023-09-06T00:44:29Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.