Reward Modeling from Natural Language Human Feedback
- URL: http://arxiv.org/abs/2601.07349v1
- Date: Mon, 12 Jan 2026 09:23:43 GMT
- Title: Reward Modeling from Natural Language Human Feedback
- Authors: Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang, Shaoning Sun, Yujiu Yang, Yongbin Li,
- Abstract summary: Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs)<n>In this paper, we demonstrate that such binary classification tasks make GRMs susceptible to guessing correct outcomes without sound critiques.<n>We propose Reward Modeling from Natural Language Human Feedback (RM-NLHF), which leverages natural language feedback to obtain process reward signals.
- Score: 77.75758630455357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs). Typically in pairwise rewarding tasks, GRMs generate reasoning chains ending with critiques and preference labels, and RLVR then relies on the correctness of the preference labels as the training reward. However, in this paper, we demonstrate that such binary classification tasks make GRMs susceptible to guessing correct outcomes without sound critiques. Consequently, these spurious successes introduce substantial noise into the reward signal, thereby impairing the effectiveness of reinforcement learning. To address this issue, we propose Reward Modeling from Natural Language Human Feedback (RM-NLHF), which leverages natural language feedback to obtain process reward signals, thereby mitigating the problem of limited solution space inherent in binary tasks. Specifically, we compute the similarity between GRM-generated and human critiques as the training reward, which provides more accurate reward signals than outcome-only supervision. Additionally, considering that human critiques are difficult to scale up, we introduce Meta Reward Model (MetaRM) which learns to predict process reward from datasets with human critiques and then generalizes to data without human critiques. Experiments on multiple benchmarks demonstrate that our method consistently outperforms state-of-the-art GRMs trained with outcome-only reward, confirming the superiority of integrating natural language over binary human feedback as supervision.
Related papers
- R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging [69.96389360650072]
We show that reasoning fidelity is highly predictive of downstream RLHF outcomes, beyond standard label accuracy.<n>We propose Rationale-Centric Alignment, R-Align, which augments training with gold judgments and explicitly supervises rationale alignment.
arXiv Detail & Related papers (2026-02-06T15:17:11Z) - OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [38.1645520104553]
We introduce OpenRubrics, a large-scale collection of (prompt,explicit) pairs for training rubric-generation and rubric-based reward models.<n>To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Generation (CRG), which derives both hard rules (implicit qualities) and principles (implicit qualities) by contrasting preferred and rejected responses.<n>Our results show that rubrics provide scalable alignment signals that narrow the gap between costly human evaluation and automated reward modeling.
arXiv Detail & Related papers (2025-10-09T03:31:26Z) - Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback [52.1410307583181]
We useReinforcement Learning from Human Feedback to train language models (LMs) to follow complex human preferences.<n>As training progresses, the responses generated by the LM no longer resemble the responses seen by the reward model (RM)<n>We propose Off-Policy Corrected Reward Modeling to correct the RM using importance weighting, without requiring new labels or samples.
arXiv Detail & Related papers (2025-07-21T11:19:04Z) - Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models [50.4652276723694]
Think-RM generates flexible, self-guided reasoning traces that support advanced capabilities.<n>Think-RM achieves state-of-the-art results on RM-Bench, outperforming both BT RM and vertically scaled GenRM by 8%.
arXiv Detail & Related papers (2025-05-22T05:56:11Z) - Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment [35.80989342492335]
noisy preferences in human feedback can lead to reward misgeneralization.<n>This paper aims to identify how noisy preferences differ from human-aligned preferences in reward modeling.<n>We propose an online Collaborative Reward Modeling framework to achieve robust preference learning.
arXiv Detail & Related papers (2025-05-15T10:58:20Z) - RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - Zero-Shot LLMs in Human-in-the-Loop RL: Replacing Human Feedback for Reward Shaping [2.427844597259453]
Reinforcement learning (RL) often struggles with reward misalignment.<n>Human-in-the-loop (HITL) methods can mitigate this issue, but they also introduce biases.<n>We propose two key contributions to address these challenges.
arXiv Detail & Related papers (2025-03-26T03:17:12Z) - RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution [50.171320156632866]
Reinforcement learning from human feedback offers a promising approach to aligning large language models with human preferences.<n>Current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence.<n>We propose a more fine-grained, token-level guidance approach for RL training.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.