Aligning Crowd Feedback via Distributional Preference Reward Modeling
- URL: http://arxiv.org/abs/2402.09764v3
- Date: Thu, 30 May 2024 15:39:17 GMT
- Title: Aligning Crowd Feedback via Distributional Preference Reward Modeling
- Authors: Dexun Li, Cong Zhang, Kuicai Dong, Derrick Goh Xin Deik, Ruiming Tang, Yong Liu,
- Abstract summary: We propose the Distributional Preference Reward Model (DPRM) to align large language models with diverse human preferences.
Our experiments show that DPRM significantly enhances the alignment of LLMs with population preference, yielding more accurate, unbiased, and contextually appropriate responses.
- Score: 28.754532173765686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Reinforcement Learning is widely used for aligning Large Language Models (LLM) with human preference. However, the conventional reward modelling is predominantly dependent on human annotations provided by a select cohort of individuals. Such dependence may unintentionally result in skewed models that reflect the inclinations of these annotators, thereby failing to adequately represent the wider population's expectations. We propose the Distributional Preference Reward Model (DPRM), a simple yet effective framework to align large language models with diverse human preferences. To this end, we characterize multiple preferences by a categorical distribution and introduce a Bayesian updater to accommodate shifted or new preferences. On top of that, we design an optimal-transportation-based loss to calibrate DPRM to align with the preference distribution. Finally, the expected reward is utilized to fine-tune an LLM policy to generate responses favoured by the population. Our experiments show that DPRM significantly enhances the alignment of LLMs with population preference, yielding more accurate, unbiased, and contextually appropriate responses.
Related papers
- Rethinking Diverse Human Preference Learning through Principal Component Analysis [22.123631189289963]
We introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons.
Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA)
DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training.
arXiv Detail & Related papers (2025-02-18T18:55:26Z) - Calibrated Multi-Preference Optimization for Aligning Diffusion Models [92.90660301195396]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models.
CaPO incorporates the general preference from multiple reward models without human annotated data.
Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z) - No Preference Left Behind: Group Distributional Preference Optimization [46.98320272443297]
Group Distribution Preference Optimization (GDPO) is a novel framework that aligns language models with the distribution of preferences within a group.
GDPO calibrates a language model using statistical estimation of the group's belief distribution.
GDPO consistently reduces this alignment gap during training.
arXiv Detail & Related papers (2024-12-28T23:30:47Z) - ComPO: Community Preferences for Language Model Personalization [122.54846260663922]
ComPO is a method to personalize preference optimization in language models.
We collect and release ComPRed, a question answering dataset with community-level preferences from Reddit.
arXiv Detail & Related papers (2024-10-21T14:02:40Z) - Quantile Regression for Distributional Reward Models in RLHF [1.8130068086063336]
We introduce Quantile Reward Models (QRMs), a novel approach to reward modeling that learns a distribution over rewards instead of a single scalar value.
Our method uses quantile regression to estimate a full, potentially multimodal distribution over preferences, providing a more powerful and nuanced representation of preferences.
Our experimental results show that QRM outperforms comparable traditional point-estimate models on RewardBench.
arXiv Detail & Related papers (2024-09-16T10:54:04Z) - Deep Bayesian Active Learning for Preference Modeling in Large Language Models [84.817400962262]
We propose the Bayesian Active Learner for Preference Modeling (BAL-PM) for Preference Modeling.
BAL-PM requires 33% to 68% fewer preference labels in two popular human preference datasets and exceeds previous Bayesian acquisition policies.
Our experiments demonstrate that BAL-PM requires 33% to 68% fewer preference labels in two popular human preference datasets and exceeds previous Bayesian acquisition policies.
arXiv Detail & Related papers (2024-06-14T13:32:43Z) - Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Language model (LM) post-training involves maximizing a reward function that is derived from preference annotations.
DPO is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning.
We analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs.
arXiv Detail & Related papers (2024-05-29T17:39:48Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - MaxMin-RLHF: Alignment with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.