Multi-turn Reinforcement Learning from Preference Human Feedback
- URL: http://arxiv.org/abs/2405.14655v1
- Date: Thu, 23 May 2024 14:53:54 GMT
- Title: Multi-turn Reinforcement Learning from Preference Human Feedback
- Authors: Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, RĂ©mi Munos,
- Abstract summary: Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models with human preferences.
Existing methods work by emulating the preferences at the single decision (turn) level.
We develop novel methods for Reinforcement Learning from preference feedback between two full multi-turn conversations.
- Score: 41.327438095745315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to achieve a long-term goal. In this paper, we address this issue by developing novel methods for Reinforcement Learning (RL) from preference feedback between two full multi-turn conversations. In the tabular setting, we present a novel mirror-descent-based policy optimization algorithm for the general multi-turn preference-based RL problem, and prove its convergence to Nash equilibrium. To evaluate performance, we create a new environment, Education Dialogue, where a teacher agent guides a student in learning a random topic, and show that a deep RL variant of our algorithm outperforms RLHF baselines. Finally, we show that in an environment with explicit rewards, our algorithm recovers the same performance as a reward-based RL baseline, despite relying solely on a weaker preference signal.
Related papers
- Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.
The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.
We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - LIRE: listwise reward enhancement for preference alignment [27.50204023448716]
We propose a gradient-based reward optimization approach that incorporates the offline rewards of multiple responses into a streamlined listwise framework.
LIRE is straightforward to implement, requiring minimal parameter tuning, and seamlessly aligns with the pairwise paradigm.
Our experiments demonstrate that LIRE consistently outperforms existing methods across several benchmarks on dialogue and summarization tasks.
arXiv Detail & Related papers (2024-05-22T10:21:50Z) - MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with
Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - Nash Learning from Human Feedback [86.09617990412941]
We introduce an alternative pipeline for the fine-tuning of large language models using pairwise human feedback.
We term this approach Nash learning from human feedback (NLHF)
We present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent.
arXiv Detail & Related papers (2023-12-01T19:26:23Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.