CHARM: Calibrating Reward Models With Chatbot Arena Scores
- URL: http://arxiv.org/abs/2504.10045v1
- Date: Mon, 14 Apr 2025 09:51:09 GMT
- Title: CHARM: Calibrating Reward Models With Chatbot Arena Scores
- Authors: Xiao Zhu, Chenmien Tan, Pinzhen Chen, Rico Sennrich, Yanlin Zhang, Hanxu Hu,
- Abstract summary: Reward models (RMs) play a crucial role in Reinforcement Learning from Human Feedback by serving as proxies for human preferences in aligning large language models.<n>We identify a model preference bias in RMs, where they systematically assign disproportionately high scores to responses from certain policy models.<n>To address this issue, we propose a calibration method named CHatbot Arena Reward Modeling (CHARM) that leverages Elo scores from the Arena leaderboard to mitigate RM overvaluation.
- Score: 31.599659350165354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward models (RMs) play a crucial role in Reinforcement Learning from Human Feedback by serving as proxies for human preferences in aligning large language models. In this paper, we identify a model preference bias in RMs, where they systematically assign disproportionately high scores to responses from certain policy models. This bias distorts ranking evaluations and leads to unfair judgments. To address this issue, we propose a calibration method named CHatbot Arena calibrated Reward Modeling (CHARM) that leverages Elo scores from the Chatbot Arena leaderboard to mitigate RM overvaluation. We also introduce a Mismatch Degree metric to measure this preference bias. Our approach is computationally efficient, requiring only a small preference dataset for continued training of the RM. We conduct extensive experiments on reward model benchmarks and human preference alignment. Results demonstrate that our calibrated RMs (1) achieve improved evaluation accuracy on RM-Bench and the Chat-Hard domain of RewardBench, and (2) exhibit a stronger correlation with human preferences by producing scores more closely aligned with Elo rankings. By mitigating model preference bias, our method provides a generalizable and efficient solution for building fairer and more reliable reward models.
Related papers
- Self-rationalization improves LLM as a fine-grained judge [21.917301609125417]
We introduce Self-Rationalization, an iterative process of improving the rationales for the judge models.
Self-rationalization works by having the model generate multiple judgments with rationales for the same input.
We show that our model learns to produce higher quality rationales, with a win rate of $62%$ on average compared to models just trained via SFT on rationale.
arXiv Detail & Related papers (2024-10-07T21:05:53Z) - Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment [51.14207112118503]
We introduce preference embedding, an approach that embeds responses into a latent space to capture preferences efficiently.<n>We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback.<n>Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z) - Post-hoc Reward Calibration: A Case Study on Length Bias [28.266675778940133]
Reward models (RMs) can develop biases by exploiting spurious correlations in their training data.
These biases can lead to incorrect output rankings, sub-optimal model evaluations, and the amplification of undesirable behaviours.
This paper addresses the challenge of correcting such biases without additional data and training.
arXiv Detail & Related papers (2024-09-25T22:30:42Z) - Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [23.27203570485055]
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models with human preferences.
We propose a two-stage approach to train a reward model (RM) with multi-dimensional absolute-rating data.
We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow on top of the ArmoRM.
arXiv Detail & Related papers (2024-06-18T17:58:28Z) - Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem.
PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins.
Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - Aligning Large Language Models by On-Policy Self-Judgment [49.31895979525054]
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning.
We present a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient.
We show that the rejecting sampling by itself can improve performance further without an additional evaluator.
arXiv Detail & Related papers (2024-02-17T11:25:26Z) - WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking.
Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z) - Confronting Reward Model Overoptimization with Constrained RLHF [114.71591361764547]
We show that correlation between component RMs has a significant effect on the locations of these points.
Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers.
arXiv Detail & Related papers (2023-10-06T16:59:17Z) - The Trickle-down Impact of Reward (In-)consistency on RLHF [71.37987812944971]
We show that reward inconsistency exhibits a trickle-down effect on the downstream Reinforcement Learning from Human Feedback process.
We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM.
We show that RLHF models trained with a more consistent RM yield more useful responses.
arXiv Detail & Related papers (2023-09-28T04:05:13Z) - Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context.
This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other.
We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z) - Scaling Laws for Reward Model Overoptimization [19.93331579503503]
We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling.
We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup.
arXiv Detail & Related papers (2022-10-19T17:56:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.