Related papers: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

URL: http://arxiv.org/abs/2307.15217v2
Date: Mon, 11 Sep 2023 17:25:24 GMT
Title: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Authors: Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J\'er\'emy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Rapha\"el Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem B{\i}y{\i}k, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
Abstract summary: Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.
Score: 46.701165912225086
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.

Related papers

Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities [62.05713042908654]
This paper provides a review of advances in Large Language Models (LLMs) alignment through the lens of inverse reinforcement learning (IRL)<n>We highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift.
arXiv Detail & Related papers (2025-07-17T14:22:24Z)
One Framework to Rule Them All: Unifying RL-Based and RL-Free Methods in RLHF [2.1212179660694104]
We primarily examine a variety of RL-based and RL-free methods designed to address Reinforcement Learning from Human Feedback (RLHF) and Large Reasoning Models (LRMs) We reinterpret several RL-based and RL-free algorithms through the perspective of neural structured bandit prediction. This leads to a detailed derivation of the standard RLHF objective within a full RL context, demonstrating its equivalence to neural structured bandit prediction.
arXiv Detail & Related papers (2025-03-25T10:23:26Z)
Provably Efficient RLHF Pipeline: A Unified View from Contextual Bandits [59.30310692855397]
We propose a unified framework for the RLHF pipeline from the view of contextual bandits. We decompose the RLHF process into two distinct stages: (post-)training and deployment. We then develop novel algorithms for each stage, demonstrating significant improvements in both statistical and computational efficiency.
arXiv Detail & Related papers (2025-02-11T02:36:01Z)
AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations [0.2106667480549292]
We show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. We highlight tensions and contradictions inherent in the goals of RLxF. We conclude by urging researchers and practitioners alike to critically assess the sociotechnical ramifications of RLxF.
arXiv Detail & Related papers (2024-06-26T13:42:13Z)
Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy. The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms. We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z)
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework [27.336483161388777]
We introduce OpenRLHF, a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers.<n> Experimental results show that OpenRLHF achieves superior training efficiency with speedups ranging from 1.22x to 1.68x across different model sizes.
arXiv Detail & Related papers (2024-05-20T01:04:40Z)
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs [49.386699863989335]
Training large language models (LLMs) to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences. In this paper, we analyze RLHF through the lens of reinforcement learning principles to develop an understanding of its fundamentals.
arXiv Detail & Related papers (2024-04-12T15:54:15Z)
Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF [82.73541793388]
We introduce the first principled algorithmic framework for solving bilevel RL problems through the lens of penalty formulation. We provide theoretical studies of the problem landscape and its penalty-based gradient (policy) algorithms. We demonstrate the effectiveness of our algorithms via simulations in the Stackelberg Markov game, RL from human feedback and incentive design.
arXiv Detail & Related papers (2024-02-10T04:54:15Z)
A Survey of Reinforcement Learning from Human Feedback [28.92654784501927]
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. This article provides a comprehensive overview of the fundamentals of RLHF, exploring the intricate dynamics between RL agents and human input.
arXiv Detail & Related papers (2023-12-22T18:58:06Z)
SuperHF: Supervised Iterative Learning from Human Feedback [20.22920163075946]
We focus on two prevalent methods used to align large language models, Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) We propose a novel approach, Supervised Iterative Learning from Human Feedback (SuperHF), which seeks to leverage the strengths of both methods. Our experimental results show SuperHF exceeds PPO-based RLHF on the training objective, easily and favorably trades off high reward with low reward hacking, improves downstream calibration, and performs the same on our GPT-4 based qualitative evaluation scheme all the while being significantly simpler to implement.
arXiv Detail & Related papers (2023-10-25T16:52:00Z)
Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions. CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z)
The History and Risks of Reinforcement Learning and Human Feedback [0.16843915833103415]
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models easier to use and more effective. A core piece of the RLHF process is the training and utilization of a model of human preferences that acts as a reward function for optimization. RLHF reward models are often cited as being central to achieving performance, yet very few descriptors of capabilities, evaluations, training methods, or open-source models exist.
arXiv Detail & Related papers (2023-10-20T15:45:16Z)
Secrets of RLHF in Large Language Models Part I: PPO [81.01936993929127]
Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit. In this report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training.
arXiv Detail & Related papers (2023-07-11T01:55:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.