Jackpot! Alignment as a Maximal Lottery
- URL: http://arxiv.org/abs/2501.19266v1
- Date: Fri, 31 Jan 2025 16:26:28 GMT
- Title: Jackpot! Alignment as a Maximal Lottery
- Authors: Roberto-Rafael Maura-Rivero, Marc Lanctot, Francesco Visin, Kate Larson,
- Abstract summary: We propose the use of a probabilistic Social Choice rule called emphmaximal lotteries as a replacement for RLHF.
We show that a family of alignment techniques, namely Nash Learning from Human Feedback (NLHF) citemunos2023nash and variants, approximate maximal lottery outcomes and thus inherit its beneficial properties.
- Score: 13.984371386519424
- License:
- Abstract: Reinforcement Learning from Human Feedback (RLHF), the standard for aligning Large Language Models (LLMs) with human values, is known to fail to satisfy properties that are intuitively desirable, such as respecting the preferences of the majority \cite{ge2024axioms}. To overcome these issues, we propose the use of a probabilistic Social Choice rule called \emph{maximal lotteries} as a replacement for RLHF. We show that a family of alignment techniques, namely Nash Learning from Human Feedback (NLHF) \cite{munos2023nash} and variants, approximate maximal lottery outcomes and thus inherit its beneficial properties. We confirm experimentally that our proposed methodology handles situations that arise when working with preferences more robustly than standard RLHF, including supporting the preferences of the majority, providing principled ways of handling non-transitivities in the preference data, and robustness to irrelevant alternatives. This results in systems that better incorporate human values and respect human intentions.
Related papers
- Optimal Design for Reward Modeling in RLHF [83.3614658277817]
We formalize the reward training model in Reinforcement Learning from Human Feedback.
We frame the selection of an effective dataset as a simple regret minimization task.
We derive bounds on the simple regret under appropriate assumptions.
arXiv Detail & Related papers (2024-10-22T14:36:44Z) - Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning [12.742158403867002]
Reinforcement Learning from Human Feedback is a powerful paradigm for aligning foundation models to human values and preferences.
Current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population.
We develop a class of multimodal RLHF methods to address the need for pluralistic alignment.
arXiv Detail & Related papers (2024-08-19T15:18:30Z) - Robust Reinforcement Learning from Corrupted Human Feedback [86.17030012828003]
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data.
We propose a robust RLHF approach -- $R3M$, which models the potentially corrupted preference label as sparse outliers.
Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R3M$ improves robustness of the reward against several types of perturbations to the preference data.
arXiv Detail & Related papers (2024-06-21T18:06:30Z) - Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback.
Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference.
Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z) - RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation [24.374185140811115]
Reinforcement learning from human feedback (RLHF) has been an effective technique for aligning AI systems with human values.
In this paper, we focus on addressing the issues due to the inherent heterogeneity in human preferences, as well as their potential strategic behavior in providing feedback.
We propose two frameworks to address heterogeneous human feedback in principled ways: personalization-based one and aggregation-based one.
arXiv Detail & Related papers (2024-04-30T23:57:23Z) - Provable Multi-Party Reinforcement Learning with Diverse Human Feedback [63.830731470186855]
Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences.
We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals.
We incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties.
arXiv Detail & Related papers (2024-03-08T03:05:11Z) - MaxMin-RLHF: Alignment with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - KTO: Model Alignment as Prospect Theoretic Optimization [67.44320255397506]
Kahneman & Tversky's $textitprospect theory$ tells us that humans perceive random variables in a biased but well-defined manner.
We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases.
We propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences.
arXiv Detail & Related papers (2024-02-02T10:53:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.