Related papers: Robust Preference Optimization through Reward Model Distillation

Robust Preference Optimization through Reward Model Distillation

URL: http://arxiv.org/abs/2405.19316v1
Date: Wed, 29 May 2024 17:39:48 GMT
Title: Robust Preference Optimization through Reward Model Distillation
Authors: Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, Jonathan Berant,
Abstract summary: Language model (LM) post-training involves maximizing a reward function that is derived from preference annotations. DPO is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. We analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs.
Score: 68.65844394615702
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, typical preference datasets have only a single, or at most a few, annotation per preference pair, which causes DPO to overconfidently assign rewards that trend towards infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probabilities of the preferred generations to go to zero. In this work, we analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs: we train the LM to produce probabilities that match the distribution induced by a reward model trained on the preference data. Moreover, to account for uncertainty in the reward model we are distilling from, we optimize against a family of reward models that, as a whole, is likely to include at least one reasonable proxy for the preference distribution. Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations, while preserving the simple supervised nature of DPO.

Related papers

Positive-Unlabeled Reinforcement Learning Distillation for On-Premise Small Models [130.8912476550625]
We propose a positive-unlabeled (PU) reinforcement learning distillation method for on-premise small-model deployment.<n>Our method distills the teacher's preference-optimization capability from black-box generations into a locally trainable student.<n>Experiments demonstrate that our method achieves consistently strong performance under a low-cost setting.
arXiv Detail & Related papers (2026-01-28T15:14:50Z)
A Principled Loss Function for Direct Language Model Alignment [0.0]
We propose a novel loss function derived directly from the RLHF optimality condition.<n>Our proposed loss targets a specific finite value for the logits, which is dictated by the underlying reward, rather than its difference.<n>This inherent stability prevents reward hacking and leads to more effective alignment.
arXiv Detail & Related papers (2025-08-10T01:56:58Z)
Towards Reliable, Uncertainty-Aware Alignment [12.63619480522393]
We study the variability of reward model training on open-source benchmarks.<n>We propose a variance-aware policy optimization framework for preference-based alignment.
arXiv Detail & Related papers (2025-07-21T12:51:29Z)
Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling [27.11560841914813]
We introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses.<n>We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails.
arXiv Detail & Related papers (2025-07-08T21:56:33Z)
Fake it till You Make it: Reward Modeling as Discriminative Prediction [49.31309674007382]
GAN-RM is an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering.<n>Our method trains the reward model through discrimination between a small set of representative, unpaired target samples.<n>Experiments demonstrate our GAN-RM's effectiveness across multiple key applications.
arXiv Detail & Related papers (2025-06-16T17:59:40Z)
Self-NPO: Negative Preference Optimization of Diffusion Models by Simply Learning from Itself without Explicit Preference Annotations [60.143658714894336]
Diffusion models have demonstrated remarkable success in various visual generation tasks, including image, video, and 3D content generation.<n> Preference optimization (PO) is a prominent and growing area of research that aims to align these models with human preferences.<n>We introduce Self-NPO, a Negative Preference Optimization approach that learns exclusively from the model itself.
arXiv Detail & Related papers (2025-05-17T01:03:46Z)
DPR: Diffusion Preference-based Reward for Offline Reinforcement Learning [30.654668373387214]
We propose a novel preference-based reward acquisition method: Diffusion Preference-based Reward (DPR) DPR uses diffusion models to directly model preference distributions for state-action pairs, allowing rewards to be discriminatively obtained from these distributions. We apply the above methods to existing offline reinforcement learning algorithms and a series of experiment results demonstrate that the diffusion-based reward acquisition approach outperforms previous-based and Transformer-based methods.
arXiv Detail & Related papers (2025-03-03T03:49:38Z)
Calibrated Multi-Preference Optimization for Aligning Diffusion Models [92.90660301195396]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models. CaPO incorporates the general preference from multiple reward models without human annotated data. Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z)
Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z)
Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both [6.102274021710727]
Direct Reward Distillation and policy-Optimization (DRDO) is a supervised knowledge distillation-based preference alignment method. DRDO directly mimics rewards assigned by an oracle while learning human preferences from a novel preference likelihood formulation. Our experimental results on the Ultrafeedback and TL;DR datasets demonstrate that policies trained using DRDO surpass previous methods.
arXiv Detail & Related papers (2024-10-11T02:19:11Z)
General Preference Modeling with Preference Representations for Aligning Language Models [51.14207112118503]
We introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently. We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z)
Quantile Regression for Distributional Reward Models in RLHF [1.8130068086063336]
We introduce Quantile Reward Models (QRMs), a novel approach to reward modeling that learns a distribution over rewards instead of a single scalar value. Our method uses quantile regression to estimate a full, potentially multimodal distribution over preferences, providing a more powerful and nuanced representation of preferences. Our experimental results show that QRM outperforms comparable traditional point-estimate models on RewardBench.
arXiv Detail & Related papers (2024-09-16T10:54:04Z)
Bridging Model-Based Optimization and Generative Modeling via Conservative Fine-Tuning of Diffusion Models [54.132297393662654]
We introduce a hybrid method that fine-tunes cutting-edge diffusion models by optimizing reward models through RL. We demonstrate the capability of our approach to outperform the best designs in offline data, leveraging the extrapolation capabilities of reward models.
arXiv Detail & Related papers (2024-05-30T03:57:29Z)
Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions. We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance. Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z)
Direct Preference Optimization With Unobserved Preference Heterogeneity [16.91835461818937]
This paper presents a new method to align generative models with varied human preferences. We propose an Expectation-Maximization adaptation to DPO, generating a mixture of models based on latent preference types of the annotators. Our algorithms leverage the simplicity of DPO while accommodating diverse preferences.
arXiv Detail & Related papers (2024-05-23T21:25:20Z)
Soft Preference Optimization: Aligning Language Models to Expert Distributions [40.84391304598521]
SPO is a method for aligning generative models, such as Large Language Models (LLMs), with human preferences. SPO integrates preference loss with a regularization term across the model's entire output distribution. We showcase SPO's methodology, its theoretical foundation, and its comparative advantages in simplicity, computational efficiency, and alignment precision.
arXiv Detail & Related papers (2024-04-30T19:48:55Z)
RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models. The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
Aligning Crowd Feedback via Distributional Preference Reward Modeling [28.754532173765686]
We propose the Distributional Preference Reward Model (DPRM) to align large language models with diverse human preferences. Our experiments show that DPRM significantly enhances the alignment of LLMs with population preference, yielding more accurate, unbiased, and contextually appropriate responses.
arXiv Detail & Related papers (2024-02-15T07:29:43Z)
Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling. We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z)
Scaling Laws for Reward Model Overoptimization [19.93331579503503]
We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup.
arXiv Detail & Related papers (2022-10-19T17:56:10Z)
Models of human preference for learning reward functions [80.39289349661364]
We learn the reward function from human-generated preferences between pairs of trajectory segments. We find this assumption to be flawed and propose modeling human preferences as informed by each segment's regret. Our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned.
arXiv Detail & Related papers (2022-06-05T17:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.