RewardAnything: Generalizable Principle-Following Reward Models
- URL: http://arxiv.org/abs/2506.03637v2
- Date: Mon, 07 Jul 2025 09:53:22 GMT
- Title: RewardAnything: Generalizable Principle-Following Reward Models
- Authors: Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, Wei Ye,
- Abstract summary: Reward models are typically trained on fixed preference datasets.<n>This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another.<n>We introduce generalizable, principle-following reward models.<n>We present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles.
- Score: 82.16312590749052
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reward Models, essential for guiding Large Language Model optimization, are typically trained on fixed preference datasets, resulting in rigid alignment to single, implicit preference distributions. This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another. The standard practice of collecting task-specific preference data and retraining reward models is resource-intensive, often producing biased rewards, and limits practical application. We introduce generalizable, principle-following reward models. We propose that RMs should understand and adhere to dynamically provided natural language specifications of reward principles, similar to instruction-following in LLMs. To measure this capability, we develop RABench, a comprehensive benchmark for RMs focusing on generalization across diverse principles. Evaluations on RABench reveal poor generalization of current RMs. As a solution, we present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles. We achieve SotA performance with RewardAnything in traditional RM benchmark simply by specifying a well-defined principle, and results on RABench show we excel in adapting to novel principles without retraining. Furthermore, RewardAnything integrates seamlessly with existing RLHF methods and we show by a case study on how to automatically and efficiently align LLMs with only natural language principles.
Related papers
- Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [8.143110220871614]
We introduce RaR, a framework that uses structured, checklist-style rubrics as interpretable reward signals.<n>By treating rubrics as structured reward signals, we show that RaR enables smaller-scale judge models to better align with human preferences.
arXiv Detail & Related papers (2025-07-23T17:57:55Z) - GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z) - Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective [6.069069082518759]
We study the Zero-Reward Assumption in reinforcement learning for large language models (LLMs)<n>We show that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model.<n>We propose a new algorithm: Token-Reinforced Policy Optimization (TRePO)
arXiv Detail & Related papers (2025-06-03T07:44:31Z) - Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards [1.1981384995161284]
We propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards.<n>We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm.<n>Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning.
arXiv Detail & Related papers (2025-05-30T14:34:57Z) - Latent Principle Discovery for Language Model Self-Improvement [14.137106102563514]
We propose eliciting latent attributes guiding model reasoning towards human-preferred responses by explicitly modeling them in a self-correction setting.<n>Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering.<n>We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval.
arXiv Detail & Related papers (2025-05-22T17:20:18Z) - RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - Inference-Time Scaling for Generalist Reward Modeling [25.62000059973935]
Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale.<n>Key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules.<n>In this work, we investigate how to improve reward modeling with more inference compute for general queries.
arXiv Detail & Related papers (2025-04-03T11:19:49Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning.
Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.