LRHP: Learning Representations for Human Preferences via Preference Pairs
- URL: http://arxiv.org/abs/2410.04503v1
- Date: Sun, 6 Oct 2024 14:48:28 GMT
- Title: LRHP: Learning Representations for Human Preferences via Preference Pairs
- Authors: Chenglong Wang, Yang Gan, Yifu Huo, Yongyu Mu, Qiaozhi He, Murun Yang, Tong Xiao, Chunliang Zhang, Tongran Liu, Jingbo Zhu,
- Abstract summary: We introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences.
We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction.
- Score: 45.056558199304554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To improve human-preference alignment training, current research has developed numerous preference datasets consisting of preference pairs labeled as "preferred" or "dispreferred". These preference pairs are typically used to encode human preferences into a single numerical value through reward modeling, which acts as a reward signal during reinforcement learning from human feedback (RLHF). However, representing these human preferences as a numerical value complicates the analysis of these preferences and restricts their broader applications other than RLHF. In contrast, in this work, we introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences. We further develop a more generalizable framework, Learning Representations for Human Preferences via preference pairs (namely LRHP), which extends beyond traditional reward modeling to tackle this task. We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction. Building upon the human preferences in representations, we achieve strong performance in both tasks, significantly outperforming baselines.
Related papers
- Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data [56.72693923645586]
RefAlign is a versatile REINFORCE-style alignment algorithm free of reference and reward models.
It can be readily extended to diverse scenarios, such as safety and confidence alignment, by incorporating the similarity reward with task-related objectives.
arXiv Detail & Related papers (2025-04-14T05:43:21Z) - Learning a Canonical Basis of Human Preferences from Binary Ratings [28.975782992900065]
This paper shifts the focus to understanding the preferences encoded in such datasets and identifying common human preferences.
We find that a small subset of 21 preference categories captures >89% of preference variation across individuals.
This small set of preferences is analogous to a canonical basis of human preferences, similar to established findings that characterize human variation in psychology or facial recognition studies.
arXiv Detail & Related papers (2025-03-31T14:35:48Z) - Rethinking Diverse Human Preference Learning through Principal Component Analysis [22.123631189289963]
We introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons.
Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA)
DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training.
arXiv Detail & Related papers (2025-02-18T18:55:26Z) - Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality.
We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations.
We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z) - PREDICT: Preference Reasoning by Evaluating Decomposed preferences Inferred from Candidate Trajectories [3.0102456679931944]
This paper introduces PREDICT, a method designed to enhance the precision and adaptability of inferring preferences.
We evaluate PREDICT on two distinct environments: a gridworld setting and a new text-domain environment.
arXiv Detail & Related papers (2024-10-08T18:16:41Z) - General Preference Modeling with Preference Representations for Aligning Language Models [51.14207112118503]
We introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently.
We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback.
Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z) - PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences [6.398937923320069]
We propose PAL, a framework to model human preference complementary to existing pretraining strategies.
We show that PAL achieves competitive reward model accuracy compared to strong baselines.
arXiv Detail & Related papers (2024-06-12T17:54:54Z) - Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from Human Input [17.131441665935128]
We study how to extract fine-grained data regarding why an example is preferred that is useful for learning more accurate reward models.
Our findings suggest that incorporating pragmatic feature preferences is a promising approach for more efficient user-aligned reward learning.
arXiv Detail & Related papers (2024-05-23T16:36:16Z) - Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
A common technique for aligning large language models (LLMs) relies on acquiring human preferences.
We propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs.
We find that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs.
arXiv Detail & Related papers (2024-03-31T02:05:40Z) - MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with
Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - Models of human preference for learning reward functions [80.39289349661364]
We learn the reward function from human-generated preferences between pairs of trajectory segments.
We find this assumption to be flawed and propose modeling human preferences as informed by each segment's regret.
Our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned.
arXiv Detail & Related papers (2022-06-05T17:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.