Aligning Language Models with Human Preferences via a Bayesian Approach
- URL: http://arxiv.org/abs/2310.05782v3
- Date: Sat, 13 Jan 2024 11:37:57 GMT
- Title: Aligning Language Models with Human Preferences via a Bayesian Approach
- Authors: Jiashuo Wang, Haozhao Wang, Shichao Sun, Wenjie Li
- Abstract summary: In the quest to advance human-centric natural language generation (NLG) systems, ensuring alignment between NLG models and human preferences is crucial.
This paper proposes a novel approach, which employs a Bayesian framework to account for the distribution of disagreements among human preferences as training a preference model.
Our method consistently exceeds previous SOTA models in both automatic and human evaluations.
- Score: 11.984246334043673
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the quest to advance human-centric natural language generation (NLG)
systems, ensuring alignment between NLG models and human preferences is
crucial. For this alignment, current popular methods leverage a reinforcement
learning (RL) approach with a reward model trained on feedback from humans.
However, inherent disagreements due to the subjective nature of human
preferences pose a significant challenge for training the reward model,
resulting in a deterioration of the NLG performance. To tackle this issue,
previous approaches typically rely on majority voting or averaging to
consolidate multiple inconsistent preferences into a merged one. Although
straightforward to understand and execute, such methods suffer from an
inability to capture the nuanced degrees of disaggregation among humans and may
only represent a specialized subset of individuals, thereby lacking the ability
to quantitatively disclose the universality of human preferences. To address
this challenge, this paper proposes a novel approach, which employs a Bayesian
framework to account for the distribution of disagreements among human
preferences as training a preference model, and names it as d-PM. Besides,
considering the RL strategy's inefficient and complex training process over the
training efficiency, we further propose utilizing the contrastive learning
strategy to train the NLG model with the preference scores derived from the
d-PM model. Extensive experiments on two human-centric NLG tasks, i.e.,
emotional support conversation and integrity "Rule-of-Thumb" generation, show
that our method consistently exceeds previous SOTA models in both automatic and
human evaluations.
Related papers
- Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning [12.742158403867002]
Reinforcement Learning from Human Feedback is a powerful paradigm for aligning foundation models to human values and preferences.
Current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population.
We develop a class of multimodal RLHF methods to address the need for pluralistic alignment.
arXiv Detail & Related papers (2024-08-19T15:18:30Z) - Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.
The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.
We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation [24.374185140811115]
Reinforcement learning from human feedback (RLHF) has been an effective technique for aligning AI systems with human values.
In this paper, we focus on addressing the issues due to the inherent heterogeneity in human preferences, as well as their potential strategic behavior in providing feedback.
We propose two frameworks to address heterogeneous human feedback in principled ways: personalization-based one and aggregation-based one.
arXiv Detail & Related papers (2024-04-30T23:57:23Z) - MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with
Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Aligning Large Language Models with Human Preferences through Representation Engineering [41.81020951061438]
Drawing inspiration from the emerging field of representation engineering (RepE), this study aims to identify relevant representations for high-level human preferences embedded in patterns of activity within an LLM.
This novel approach, denoted as Representation Alignment from Human Feedback (RAHF), proves to be effective, computationally efficient, and easy to implement.
arXiv Detail & Related papers (2023-12-26T11:01:36Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z) - Weak Human Preference Supervision For Deep Reinforcement Learning [48.03929962249475]
The current reward learning from human preferences could be used to resolve complex reinforcement learning (RL) tasks without access to a reward function.
We propose a weak human preference supervision framework, for which we developed a human preference scaling model.
Our established human-demonstration estimator requires human feedback only for less than 0.01% of the agent's interactions with the environment.
arXiv Detail & Related papers (2020-07-25T10:37:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.