Sample Efficient Reinforcement Learning from Human Feedback via Active
Exploration
- URL: http://arxiv.org/abs/2312.00267v1
- Date: Fri, 1 Dec 2023 00:54:02 GMT
- Title: Sample Efficient Reinforcement Learning from Human Feedback via Active
Exploration
- Authors: Viraj Mehta and Vikramjeet Das and Ojash Neopane and Yijia Dai and
Ilija Bogunovic and Jeff Schneider and Willie Neiswanger
- Abstract summary: Preference-based feedback is important for many applications in reinforcement learning.
In this work, we take advantage of the fact that one can often choose contexts to obtain human feedback.
We show that our method is able to reach better performance with fewer samples of human preferences than multiple baselines.
- Score: 29.935758027209292
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Preference-based feedback is important for many applications in reinforcement
learning where direct evaluation of a reward function is not feasible. A
notable recent example arises in reinforcement learning from human feedback
(RLHF) on large language models. For many applications of RLHF, the cost of
acquiring the human feedback can be substantial. In this work, we take
advantage of the fact that one can often choose contexts at which to obtain
human feedback in order to most efficiently identify a good policy, and
formalize this as an offline contextual dueling bandit problem. We give an
upper-confidence-bound style algorithm for this problem and prove a polynomial
worst-case regret bound. We then provide empirical confirmation in a synthetic
setting that our approach outperforms existing methods. After, we extend the
setting and methodology for practical use in RLHF training of large language
models. Here, our method is able to reach better performance with fewer samples
of human preferences than multiple baselines on three real-world datasets.
Related papers
- Self-Evolved Reward Learning for LLMs [45.6910747154447]
Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences.
We propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself.
Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance.
arXiv Detail & Related papers (2024-11-01T07:29:03Z) - Navigating Noisy Feedback: Enhancing Reinforcement Learning with Error-Prone Language Models [8.025808955214957]
This paper studies the advantages and limitations of reinforcement learning from large language model feedback.
We propose a simple yet effective method for soliciting and applying feedback as a potential-based shaping function.
arXiv Detail & Related papers (2024-10-22T19:52:08Z) - Optimal Design for Reward Modeling in RLHF [83.3614658277817]
We formalize the reward training model in Reinforcement Learning from Human Feedback.
We frame the selection of an effective dataset as a simple regret minimization task.
We derive bounds on the simple regret under appropriate assumptions.
arXiv Detail & Related papers (2024-10-22T14:36:44Z) - Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
However, RLHF relies on a reward model that is trained with a limited amount of human preference data.
We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z) - Constructive Large Language Models Alignment with Diverse Feedback [76.9578950893839]
We introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance large language models alignment.
We exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems.
By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data.
arXiv Detail & Related papers (2023-10-10T09:20:14Z) - UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset.
Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z) - Provable Benefits of Policy Learning from Human Preferences in
Contextual Bandit Problems [82.92678837778358]
preference-based methods have demonstrated substantial success in empirical applications such as InstructGPT.
We show how human bias and uncertainty in feedback modelings can affect the theoretical guarantees of these approaches.
arXiv Detail & Related papers (2023-07-24T17:50:24Z) - Kernelized Offline Contextual Dueling Bandits [15.646879026749168]
In this work, we take advantage of the fact that often the agent can choose contexts at which to obtain human feedback.
We give an upper-confidence-bound style algorithm for this setting and prove a regret bound.
arXiv Detail & Related papers (2023-07-21T01:17:31Z) - SLiC-HF: Sequence Likelihood Calibration with Human Feedback [35.74135968442311]
We show how the recently introduced Sequence Likelihood (SLiC) can also be used to effectively learn from human preferences.
Experiments on the TL;DR summarization task show that SLiC-HF significantly improves supervised fine-tuning baselines.
arXiv Detail & Related papers (2023-05-17T17:57:10Z) - Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.