Related papers: Rethinking Diverse Human Preference Learning through Principal Component Analysis

Rethinking Diverse Human Preference Learning through Principal Component Analysis

URL: http://arxiv.org/abs/2502.13131v1
Date: Tue, 18 Feb 2025 18:55:26 GMT
Title: Rethinking Diverse Human Preference Learning through Principal Component Analysis
Authors: Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen,
Abstract summary: We introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons.<n>Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA)<n>DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training.
Score: 22.123631189289963
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different user needs, offering an interpretable and scalable alternative to traditional reward models. We demonstrate that DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training. Our results highlight DRMs as a powerful framework for personalized and interpretable LLM alignment.

Related papers

SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder [54.31950189922548]
Reward models (RMs) are proxies for human preference evaluation and guiding model alignment.<n>We propose SparseRM, which leverages Sparse Autoencoder (SAE) to extract preference-relevant information encoded in model representations.<n>SparseRM achieves superior performance over most mainstream RMs while using less than 1% of trainable parameters.
arXiv Detail & Related papers (2025-11-11T06:51:56Z)
Robust Preference Alignment via Directional Neighborhood Consensus [13.313830197011983]
We introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus.<n>RPS samples multiple responses from a local neighborhood of related preferences to create a superior candidate pool.<n>Our work presents a practical, theoretically-grounded solution for enhancing the reliability of preference-aligned models.
arXiv Detail & Related papers (2025-10-23T12:39:20Z)
Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression [9.624392327607833]
Large language models (LLMs) are currently aligned using techniques such as reinforcement learning from human feedback.<n>We propose a steerable pluralistic model based on few-shot comparative regression that can adapt to individual user preferences.
arXiv Detail & Related papers (2025-08-11T22:40:31Z)
MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning [22.154640547329738]
We introduce MiCRo, a two-stage framework that enhances personalized preference learning by leveraging large-scale binary preference datasets.<n>In the first stage, MiCRo introduces context-aware mixture modeling approach to capture diverse human preferences.<n>In the second stage, MiCRo integrates an online routing strategy that dynamically adapts mixture weights based on specific context to resolve ambiguity.
arXiv Detail & Related papers (2025-05-30T17:44:28Z)
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications.<n> Ensuring their alignment with the diverse preferences of individual users has become a critical challenge.<n>We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z)
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality.<n>We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations.<n>We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z)
ComPO: Community Preferences for Language Model Personalization [122.54846260663922]
ComPO is a method to personalize preference optimization in language models. We collect and release ComPRed, a question answering dataset with community-level preferences from Reddit.
arXiv Detail & Related papers (2024-10-21T14:02:40Z)
LRHP: Learning Representations for Human Preferences via Preference Pairs [45.056558199304554]
We introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences. We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction.
arXiv Detail & Related papers (2024-10-06T14:48:28Z)
Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning [12.742158403867002]
Reinforcement Learning from Human Feedback is a powerful paradigm for aligning foundation models to human values and preferences. Current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population. We develop a class of multimodal RLHF methods to address the need for pluralistic alignment.
arXiv Detail & Related papers (2024-08-19T15:18:30Z)
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences [6.398937923320069]
We propose PAL, a framework to model human preference complementary to existing pretraining strategies. We show that PAL achieves competitive reward model accuracy compared to strong baselines.
arXiv Detail & Related papers (2024-06-12T17:54:54Z)
Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values. We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO) Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z)
RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation [24.374185140811115]
Reinforcement learning from human feedback (RLHF) has been an effective technique for aligning AI systems with human values. In this paper, we focus on addressing the issues due to the inherent heterogeneity in human preferences, as well as their potential strategic behavior in providing feedback. We propose two frameworks to address heterogeneous human feedback in principled ways: personalization-based one and aggregation-based one.
arXiv Detail & Related papers (2024-04-30T23:57:23Z)
Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits. advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more. We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z)
Aligning Crowd Feedback via Distributional Preference Reward Modeling [28.754532173765686]
We propose the Distributional Preference Reward Model (DPRM) to align large language models with diverse human preferences. Our experiments show that DPRM significantly enhances the alignment of LLMs with population preference, yielding more accurate, unbiased, and contextually appropriate responses.
arXiv Detail & Related papers (2024-02-15T07:29:43Z)
Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts. RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)
On Diversified Preferences of Large Language Model Alignment [51.26149027399505]
This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes. Our analysis reveals that the impact of diversified human preferences depends on both model size and data size. Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them.
arXiv Detail & Related papers (2023-12-12T16:17:15Z)
Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging [148.77027765872006]
We study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem. LLMs are aligned to multiple preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem. We show that we can achieve personalized alignment by decomposing preferences into multiple dimensions.
arXiv Detail & Related papers (2023-10-17T20:22:13Z)
Everyone Deserves A Reward: Learning Customized Human Preferences [25.28261194665836]
Reward models (RMs) are essential for aligning large language models with human preferences to improve interaction quality. We propose a three-stage customized RM learning scheme, then empirically verify its effectiveness on both general preference datasets and our DSP set. We find several ways to better preserve the general preferring ability while training the customized RMs.
arXiv Detail & Related papers (2023-09-06T16:03:59Z)
Models of human preference for learning reward functions [80.39289349661364]
We learn the reward function from human-generated preferences between pairs of trajectory segments. We find this assumption to be flawed and propose modeling human preferences as informed by each segment's regret. Our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned.
arXiv Detail & Related papers (2022-06-05T17:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.