Everyone Deserves A Reward: Learning Customized Human Preferences
- URL: http://arxiv.org/abs/2309.03126v2
- Date: Fri, 15 Sep 2023 09:24:30 GMT
- Title: Everyone Deserves A Reward: Learning Customized Human Preferences
- Authors: Pengyu Cheng, Jiawen Xie, Ke Bai, Yong Dai, Nan Du
- Abstract summary: Reward models (RMs) are essential for aligning large language models with human preferences to improve interaction quality.
We propose a three-stage customized RM learning scheme, then empirically verify its effectiveness on both general preference datasets and our DSP set.
We find several ways to better preserve the general preferring ability while training the customized RMs.
- Score: 25.28261194665836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward models (RMs) are essential for aligning large language models (LLMs)
with human preferences to improve interaction quality. However, the real world
is pluralistic, which leads to diversified human preferences with respect to
different religions, politics, cultures, etc. Moreover, each individual can
have their unique preferences on various topics. Neglecting the diversity of
human preferences, current human feedback aligning methods only consider a
general reward model, which is below satisfaction for customized or
personalized application scenarios. To explore customized preference learning,
we collect a domain-specific preference (DSP) dataset, which includes preferred
responses for each given query from four practical domains. Besides, from the
perspective of data efficiency, we propose a three-stage customized RM learning
scheme, then empirically verify its effectiveness on both general preference
datasets and our DSP set. Furthermore, we test multiple training and data
strategies on the three learning stages. We find several ways to better
preserve the general preferring ability while training the customized RMs,
especially general preference enrichment, and customized preference imitation
learning. The DSP dataset and code are available at
https://github.com/Linear95/DSP.
Related papers
- Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback [110.16220825629749]
Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models.
In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts.
Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements.
arXiv Detail & Related papers (2024-06-13T16:17:21Z) - PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences [6.398937923320069]
We propose PAL, a framework to model human preference complementary to existing pretraining strategies.
We show that PAL achieves competitive reward model accuracy compared to strong baselines.
arXiv Detail & Related papers (2024-06-12T17:54:54Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with
Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - Personalized Language Modeling from Personalized Human Feedback [49.344833339240566]
Reinforcement Learning from Human Feedback (RLHF) is commonly used to fine-tune large language models to better align with human preferences.
In this work, we aim to address this problem by developing methods for building personalized language models.
arXiv Detail & Related papers (2024-02-06T04:18:58Z) - Promptable Behaviors: Personalizing Multi-Objective Rewards from Human
Preferences [53.353022588751585]
We present Promptable Behaviors, a novel framework that facilitates efficient personalization of robotic agents to diverse human preferences.
We introduce three distinct methods to infer human preferences by leveraging different types of interactions.
We evaluate the proposed method in personalized object-goal navigation and flee navigation tasks in ProcTHOR and RoboTHOR.
arXiv Detail & Related papers (2023-12-14T21:00:56Z) - Personalized Soups: Personalized Large Language Model Alignment via
Post-hoc Parameter Merging [148.77027765872006]
We study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem.
LLMs are aligned to multiple preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem.
We show that we can achieve personalized alignment by decomposing preferences into multiple dimensions.
arXiv Detail & Related papers (2023-10-17T20:22:13Z) - Models of human preference for learning reward functions [80.39289349661364]
We learn the reward function from human-generated preferences between pairs of trajectory segments.
We find this assumption to be flawed and propose modeling human preferences as informed by each segment's regret.
Our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned.
arXiv Detail & Related papers (2022-06-05T17:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.