MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with
Diverse Human Preferences
- URL: http://arxiv.org/abs/2402.08925v1
- Date: Wed, 14 Feb 2024 03:56:27 GMT
- Title: MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with
Diverse Human Preferences
- Authors: Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang,
Dinesh Manocha, Amrit Singh Bedi, and Mengdi Wang
- Abstract summary: Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
- Score: 101.57443597426374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) aligns language models to
human preferences by employing a singular reward model derived from preference
data. However, such an approach overlooks the rich diversity of human
preferences inherent in data collected from multiple users. In this work, we
first derive an impossibility result of alignment with single reward RLHF,
thereby highlighting its insufficiency in representing diverse human
preferences. To provide an equitable solution to the problem, we learn a
mixture of preference distributions via an expectation-maximization algorithm
and propose a MaxMin alignment objective for policy learning inspired by the
Egalitarian principle in social choice theory to better represent diverse human
preferences. We elucidate the connection of our proposed approach to
distributionally robust optimization and general utility RL, thereby
highlighting the generality and robustness of our proposed solution. We present
comprehensive experimental results on small-scale (GPT-2) and large-scale
language models (with Tulu2-7B) and show the efficacy of the proposed approach
in the presence of diversity among human preferences. Our algorithm achieves an
average improvement of more than 16% in win-rates over conventional RLHF
algorithms and improves the win-rate (accuracy) for minority groups by over 33%
without compromising the performance of majority groups, showcasing the
robustness and fairness of our approach. We remark that our findings in this
work are not only limited to language models but also extend to reinforcement
learning in general.
Related papers
- ComPO: Community Preferences for Language Model Personalization [122.54846260663922]
ComPO is a method to personalize preference optimization in language models.
We collect and release ComPRed, a question answering dataset with community-level preferences from Reddit.
arXiv Detail & Related papers (2024-10-21T14:02:40Z) - GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets [19.485572131953937]
We propose a practical application of a diversity-seeking RL algorithm called GFlowNet-DPO (GDPO) in an offline preference alignment setting.
Empirical results show GDPO can generate far more diverse responses than the baseline methods.
arXiv Detail & Related papers (2024-10-19T13:07:52Z) - Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning [12.742158403867002]
Reinforcement Learning from Human Feedback is a powerful paradigm for aligning foundation models to human values and preferences.
Current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population.
We develop a class of multimodal RLHF methods to address the need for pluralistic alignment.
arXiv Detail & Related papers (2024-08-19T15:18:30Z) - Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.
The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.
We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization [33.331389392270665]
preference matching (PM) RLHF is a novel approach that aligns large language models with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model.
Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses.
For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation.
arXiv Detail & Related papers (2024-05-26T07:00:05Z) - RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation [24.374185140811115]
Reinforcement learning from human feedback (RLHF) has been an effective technique for aligning AI systems with human values.
In this paper, we focus on addressing the issues due to the inherent heterogeneity in human preferences, as well as their potential strategic behavior in providing feedback.
We propose two frameworks to address heterogeneous human feedback in principled ways: personalization-based one and aggregation-based one.
arXiv Detail & Related papers (2024-04-30T23:57:23Z) - Provable Multi-Party Reinforcement Learning with Diverse Human Feedback [63.830731470186855]
Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences.
We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals.
We incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties.
arXiv Detail & Related papers (2024-03-08T03:05:11Z) - Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
However, RLHF relies on a reward model that is trained with a limited amount of human preference data.
We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Aligning Language Models with Human Preferences via a Bayesian Approach [11.984246334043673]
In the quest to advance human-centric natural language generation (NLG) systems, ensuring alignment between NLG models and human preferences is crucial.
This paper proposes a novel approach, which employs a Bayesian framework to account for the distribution of disagreements among human preferences as training a preference model.
Our method consistently exceeds previous SOTA models in both automatic and human evaluations.
arXiv Detail & Related papers (2023-10-09T15:15:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.