Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
- URL: http://arxiv.org/abs/2509.21500v1
- Date: Thu, 25 Sep 2025 19:57:39 GMT
- Title: Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
- Authors: Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng Jin,
- Abstract summary: Reinforcement fine-tuning (RFT) often suffers from emphreward over-optimization, where a policy model hacks the reward signals to achieve high scores.<n>Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail.<n>While off-policy exemplars are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align.
- Score: 39.36546278921025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement fine-tuning (RFT) often suffers from \emph{reward over-optimization}, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements. Our code can be accessed at https://github.com/Jun-Kai-Zhang/rubrics.git .
Related papers
- Inference-Time Reward Hacking in Large Language Models [29.829648695171425]
Reward models function as proxies for complex desiderata such as correctness, helpfulness, and safety.<n>By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance.<n>We show that hedging mitigates reward hacking and achieves superior reward-distortion tradeoffs on math, reasoning, and human-preference setups.
arXiv Detail & Related papers (2025-06-24T02:05:25Z) - Intra-Trajectory Consistency for Reward Modeling [67.84522106537274]
We develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards.<n>We show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results.
arXiv Detail & Related papers (2025-06-10T12:59:14Z) - Learning Explainable Dense Reward Shapes via Bayesian Optimization [45.34810347865996]
We frame reward shaping as an optimization problem focused on token-level credit assignment.<n>We use explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model.<n>Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines.
arXiv Detail & Related papers (2025-04-22T21:09:33Z) - RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution [50.171320156632866]
Reinforcement learning from human feedback offers a promising approach to aligning large language models with human preferences.<n>Current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence.<n>We propose a more fine-grained, token-level guidance approach for RL training.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - A Critical Look At Tokenwise Reward-Guided Text Generation [23.908449840589284]
We show that reward models trained on full sequences are not compatible with scoring partial sequences.<n>We propose to train a Bradley-Terry reward model on partial sequences explicitly, and autoregressively sample from the implied tokenwise policy during decoding time.
arXiv Detail & Related papers (2024-06-12T00:19:40Z) - Rethinking the Role of Proxy Rewards in Language Model Alignment [39.53237479058083]
We study the role of proxy rewards in the Large Language Models alignment via reverse reward engineering'
We aim to replicate the ground truth (gold) reward signal by achieving a monotonic relationship between the proxy and gold reward signals.
Our findings indicate that successfully emulating the gold reward requires generating responses that are relevant with enough length to open-ended questions.
arXiv Detail & Related papers (2024-02-02T11:58:08Z) - Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z) - Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language
Model Critique in Text Generation [29.6763730290473]
Reinforcement learning can align language models with non-differentiable reward signals, such as human preferences.
This paper introduces a novel framework that utilizes the critique capability of Large Language Models to produce intermediate-step rewards.
arXiv Detail & Related papers (2024-01-14T22:05:11Z) - Deep Reinforcement Learning from Hierarchical Preference Design [99.46415116087259]
This paper shows by exploiting certain structures, one can ease the reward design process.
We propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning.
arXiv Detail & Related papers (2023-09-06T00:44:29Z) - Reward Collapse in Aligning Large Language Models [64.98482888193267]
We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution.
Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
arXiv Detail & Related papers (2023-05-28T02:12:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.