Related papers: HAF-RM: A Hybrid Alignment Framework for Reward Model Training

HAF-RM: A Hybrid Alignment Framework for Reward Model Training

URL: http://arxiv.org/abs/2407.04185v2
Date: Thu, 11 Jul 2024 07:35:06 GMT
Title: HAF-RM: A Hybrid Alignment Framework for Reward Model Training
Authors: Shujun Liu, Xiaoyu Shen, Yuhang Lai, Siyuan Wang, Shengbin Yue, Zengfeng Huang, Xuanjing Huang, Zhongyu Wei,
Abstract summary: We propose a hybrid alignment framework HaF-RM for reward model training. It offers a principled and effective approach to enhancing the performance and alignment of reward models.
Score: 51.59246299566669
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The reward model has become increasingly important in alignment, assessment, and data construction for large language models (LLMs). Most existing researchers focus on enhancing reward models through data improvements, following the conventional training framework for reward models that directly optimizes the predicted rewards. In this paper, we propose a hybrid alignment framework HaF-RM for reward model training by introducing an additional constraint on token-level policy probabilities in addition to the reward score. It can simultaneously supervise the internal preference model at the token level and optimize the mapping layer of the reward model at the sequence level. Theoretical justifications and experiment results on five datasets show the validity and effectiveness of our proposed hybrid framework for training a high-quality reward model. By decoupling the reward modeling procedure and incorporating hybrid supervision, our HaF-RM framework offers a principled and effective approach to enhancing the performance and alignment of reward models, a critical component in the responsible development of powerful language models. We release our code at https://haf-rm.github.io.

Related papers

Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models [28.542061921495353]
There are two mainstream reward paradigms: model-based rewards and rule-based rewards.<n>Both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking.<n>We propose Cooper, a RL framework that jointly optimize both the policy model and the reward model.<n>Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct.
arXiv Detail & Related papers (2025-08-07T17:53:56Z)
Activation Reward Models for Few-Shot Model Alignment [77.37511364793515]
We introduce Activation Reward Models (Activation RMs)<n>Activation RMs leverage activation steering to construct well-aligned reward signals using minimal supervision and no additional model finetuning.<n>We demonstrate the effectiveness of Activation RMs in mitigating reward hacking behaviors, highlighting their utility for safety-critical applications.
arXiv Detail & Related papers (2025-07-02T05:10:29Z)
GRAM: A Generative Foundation Reward Model for Reward Generalization [48.63394690265176]
We develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning.<n>This model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning.
arXiv Detail & Related papers (2025-06-17T04:34:27Z)
A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization [18.892740849961456]
Reinforcement Learning from Human Feedback (RLHF) has emerged as an important paradigm for aligning large language models with human preferences during post-training. This paper introduces Pairwise-RL, a RLHF framework that addresses these challenges through a combination of generative reward modeling and a pairwise proximal policy optimization algorithm.
arXiv Detail & Related papers (2025-04-07T11:34:48Z)
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [54.4392552373835]
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs) We propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals to provide reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks.
arXiv Detail & Related papers (2025-02-26T17:19:12Z)
Evaluating Robustness of Reward Models for Mathematical Reasoning [14.97819343313859]
We introduce a new design for reliable evaluation of reward models, and to validate this, we construct RewardMATH. We demonstrate that the scores on RewardMATH strongly correlate with the results of optimized policy and effectively estimate reward overoptimization.
arXiv Detail & Related papers (2024-10-02T16:39:58Z)
Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment [50.21842377409232]
Despite vital role reward models play in alignment, previous works have consistently overlooked their performance. This work first investigates the quality of the widely-used preference dataset, HH-RLHF, and curates a clean version, CHH-RLHF. Based on CHH-RLHF, we benchmark the accuracy of a broad range of reward models used in previous alignment works, unveiling the unreliability of using them both for optimization and evaluation.
arXiv Detail & Related papers (2024-09-26T04:28:35Z)
DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging [65.41765072566287]
We propose textbfDomain knowledtextbfge merged textbfReward textbfModel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging.
arXiv Detail & Related papers (2024-07-01T17:01:54Z)
RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models. The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
ALaRM: Align Language Models via Hierarchical Rewards Modeling [41.79125107279527]
We introduce ALaRM, the first framework modeling hierarchical rewards in reinforcement learning from human feedback. The framework addresses the limitations of current alignment approaches, by integrating holistic rewards with aspect-specific rewards. We validate our approach through applications in long-form question answering and machine translation tasks.
arXiv Detail & Related papers (2024-03-11T14:28:40Z)
Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data. We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z)
Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF [79.98542868281471]
Reinforcement Learning from Human Feedback (RLHF) is a technique that aligns language models closely with human-centric values. It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective. This paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS)
arXiv Detail & Related papers (2024-01-29T17:43:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.