Related papers: LRHP: Learning Representations for Human Preferences via Preference Pairs

LRHP: Learning Representations for Human Preferences via Preference Pairs

URL: http://arxiv.org/abs/2410.04503v1
Date: Sun, 6 Oct 2024 14:48:28 GMT
Title: LRHP: Learning Representations for Human Preferences via Preference Pairs
Authors: Chenglong Wang, Yang Gan, Yifu Huo, Yongyu Mu, Qiaozhi He, Murun Yang, Tong Xiao, Chunliang Zhang, Tongran Liu, Jingbo Zhu,
Abstract summary: We introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences. We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction.
Score: 45.056558199304554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To improve human-preference alignment training, current research has developed numerous preference datasets consisting of preference pairs labeled as "preferred" or "dispreferred". These preference pairs are typically used to encode human preferences into a single numerical value through reward modeling, which acts as a reward signal during reinforcement learning from human feedback (RLHF). However, representing these human preferences as a numerical value complicates the analysis of these preferences and restricts their broader applications other than RLHF. In contrast, in this work, we introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences. We further develop a more generalizable framework, Learning Representations for Human Preferences via preference pairs (namely LRHP), which extends beyond traditional reward modeling to tackle this task. We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction. Building upon the human preferences in representations, we achieve strong performance in both tasks, significantly outperforming baselines.

Related papers

Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data [56.72693923645586]
RefAlign is a versatile REINFORCE-style alignment algorithm free of reference and reward models. It can be readily extended to diverse scenarios, such as safety and confidence alignment, by incorporating the similarity reward with task-related objectives.
arXiv Detail & Related papers (2025-04-14T05:43:21Z)
Learning a Canonical Basis of Human Preferences from Binary Ratings [28.975782992900065]
This paper shifts the focus to understanding the preferences encoded in such datasets and identifying common human preferences. We find that a small subset of 21 preference categories captures >89% of preference variation across individuals. This small set of preferences is analogous to a canonical basis of human preferences, similar to established findings that characterize human variation in psychology or facial recognition studies.
arXiv Detail & Related papers (2025-03-31T14:35:48Z)
Rethinking Diverse Human Preference Learning through Principal Component Analysis [22.123631189289963]
We introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA) DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training.
arXiv Detail & Related papers (2025-02-18T18:55:26Z)
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality. We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations. We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z)
PREDICT: Preference Reasoning by Evaluating Decomposed preferences Inferred from Candidate Trajectories [3.0102456679931944]
This paper introduces PREDICT, a method designed to enhance the precision and adaptability of inferring preferences. We evaluate PREDICT on two distinct environments: a gridworld setting and a new text-domain environment.
arXiv Detail & Related papers (2024-10-08T18:16:41Z)
General Preference Modeling with Preference Representations for Aligning Language Models [51.14207112118503]
We introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently. We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z)
Data-Centric Human Preference with Rationales for Direct Preference Alignment [23.243583332894737]
We propose augmenting standard preference pairs with rationales that explain the reasoning behind the human preference.<n>Our comprehensive analysis demonstrates that incorporating rationales improves learning efficiency.<n>Our findings showcase the potential of thoughtful data design in preference learning.
arXiv Detail & Related papers (2024-07-19T17:27:52Z)
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences [6.398937923320069]
We propose PAL, a framework to model human preference complementary to existing pretraining strategies. We show that PAL achieves competitive reward model accuracy compared to strong baselines.
arXiv Detail & Related papers (2024-06-12T17:54:54Z)
Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from Human Input [17.131441665935128]
We study how to extract fine-grained data regarding why an example is preferred that is useful for learning more accurate reward models. Our findings suggest that incorporating pragmatic feature preferences is a promising approach for more efficient user-aligned reward learning.
arXiv Detail & Related papers (2024-05-23T16:36:16Z)
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
A common technique for aligning large language models (LLMs) relies on acquiring human preferences. We propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs. We find that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs.
arXiv Detail & Related papers (2024-03-31T02:05:40Z)
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences. Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z)
Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset. We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)
Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy.<n>We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound.<n>Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z)
Models of human preference for learning reward functions [80.39289349661364]
We learn the reward function from human-generated preferences between pairs of trajectory segments. We find this assumption to be flawed and propose modeling human preferences as informed by each segment's regret. Our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned.
arXiv Detail & Related papers (2022-06-05T17:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.