Related papers: Everyone Deserves A Reward: Learning Customized Human Preferences

Everyone Deserves A Reward: Learning Customized Human Preferences

URL: http://arxiv.org/abs/2309.03126v2
Date: Fri, 15 Sep 2023 09:24:30 GMT
Title: Everyone Deserves A Reward: Learning Customized Human Preferences
Authors: Pengyu Cheng, Jiawen Xie, Ke Bai, Yong Dai, Nan Du
Abstract summary: Reward models (RMs) are essential for aligning large language models with human preferences to improve interaction quality. We propose a three-stage customized RM learning scheme, then empirically verify its effectiveness on both general preference datasets and our DSP set. We find several ways to better preserve the general preferring ability while training the customized RMs.
Score: 25.28261194665836
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reward models (RMs) are essential for aligning large language models (LLMs) with human preferences to improve interaction quality. However, the real world is pluralistic, which leads to diversified human preferences with respect to different religions, politics, cultures, etc. Moreover, each individual can have their unique preferences on various topics. Neglecting the diversity of human preferences, current human feedback aligning methods only consider a general reward model, which is below satisfaction for customized or personalized application scenarios. To explore customized preference learning, we collect a domain-specific preference (DSP) dataset, which includes preferred responses for each given query from four practical domains. Besides, from the perspective of data efficiency, we propose a three-stage customized RM learning scheme, then empirically verify its effectiveness on both general preference datasets and our DSP set. Furthermore, we test multiple training and data strategies on the three learning stages. We find several ways to better preserve the general preferring ability while training the customized RMs, especially general preference enrichment, and customized preference imitation learning. The DSP dataset and code are available at https://github.com/Linear95/DSP.

Related papers

FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users [111.56469697145519]
We propose Few-Shot Preference Optimization, which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. We generate over 1M synthetic personalized preferences using publicly available LLMs. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study.
arXiv Detail & Related papers (2025-02-26T17:08:46Z)
When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning [23.557084253364174]
Reinforcement Learning from Human Feedback (RLHF) typically assumes homogeneous preferences across users, overlooking diverse human values and minority viewpoints. We present a multi-faceted evaluation framework that measures not only performance but also fairness, unintended effects, and adaptability across varying levels of preference divergence. These findings highlight the critical need for holistic evaluation approaches to advance the development of more effective and inclusive preference learning systems.
arXiv Detail & Related papers (2025-02-26T14:14:58Z)
Rethinking Diverse Human Preference Learning through Principal Component Analysis [22.123631189289963]
We introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA) DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training.
arXiv Detail & Related papers (2025-02-18T18:55:26Z)
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications. Ensuring their alignment with the diverse preferences of individual users has become a critical challenge. We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z)
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality. We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations. We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z)
LRHP: Learning Representations for Human Preferences via Preference Pairs [45.056558199304554]
We introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences. We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction.
arXiv Detail & Related papers (2024-10-06T14:48:28Z)
Aligning LLMs with Individual Preferences via Interaction [51.72200436159636]
We train large language models (LLMs) that can ''interact to align'' We develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures. For evaluation, we establish the ALOE benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations.
arXiv Detail & Related papers (2024-10-04T17:48:29Z)
Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison [9.324894567200582]
We systematically study preference datasets through three perspectives: scale, label noise, and information content. Our work is a first step towards a data-centric approach to alignment by providing perspectives that aid in training efficiency and iterative data collection for RLHF.
arXiv Detail & Related papers (2024-09-15T03:55:03Z)
Personality Alignment of Large Language Models [26.071445846818914]
Current methods for aligning large language models (LLMs) typically aim to reflect general human values and behaviors. We introduce the concept of Personality Alignment. This approach tailors LLMs' responses and decisions to match the specific preferences of individual users or closely related groups.
arXiv Detail & Related papers (2024-08-21T17:09:00Z)
Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning [12.742158403867002]
Reinforcement Learning from Human Feedback is a powerful paradigm for aligning foundation models to human values and preferences. Current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population. We develop a class of multimodal RLHF methods to address the need for pluralistic alignment.
arXiv Detail & Related papers (2024-08-19T15:18:30Z)
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback [110.16220825629749]
Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements.
arXiv Detail & Related papers (2024-06-13T16:17:21Z)
Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values. We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO) Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z)
Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences [53.353022588751585]
We present Promptable Behaviors, a novel framework that facilitates efficient personalization of robotic agents to diverse human preferences. We introduce three distinct methods to infer human preferences by leveraging different types of interactions. We evaluate the proposed method in personalized object-goal navigation and flee navigation tasks in ProcTHOR and RoboTHOR.
arXiv Detail & Related papers (2023-12-14T21:00:56Z)
Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy. We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound. Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z)
Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging [148.77027765872006]
We study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem. LLMs are aligned to multiple preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem. We show that we can achieve personalized alignment by decomposing preferences into multiple dimensions.
arXiv Detail & Related papers (2023-10-17T20:22:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.