Weak Human Preference Supervision For Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2007.12904v2
- Date: Sat, 26 Dec 2020 02:02:31 GMT
- Title: Weak Human Preference Supervision For Deep Reinforcement Learning
- Authors: Zehong Cao, KaiChiu Wong, Chin-Teng Lin
- Abstract summary: The current reward learning from human preferences could be used to resolve complex reinforcement learning (RL) tasks without access to a reward function.
We propose a weak human preference supervision framework, for which we developed a human preference scaling model.
Our established human-demonstration estimator requires human feedback only for less than 0.01% of the agent's interactions with the environment.
- Score: 48.03929962249475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The current reward learning from human preferences could be used to resolve
complex reinforcement learning (RL) tasks without access to a reward function
by defining a single fixed preference between pairs of trajectory segments.
However, the judgement of preferences between trajectories is not dynamic and
still requires human input over thousands of iterations. In this study, we
proposed a weak human preference supervision framework, for which we developed
a human preference scaling model that naturally reflects the human perception
of the degree of weak choices between trajectories and established a
human-demonstration estimator via supervised learning to generate the predicted
preferences for reducing the number of human inputs. The proposed weak human
preference supervision framework can effectively solve complex RL tasks and
achieve higher cumulative rewards in simulated robot locomotion -- MuJoCo games
-- relative to the single fixed human preferences. Furthermore, our established
human-demonstration estimator requires human feedback only for less than 0.01\%
of the agent's interactions with the environment and significantly reduces the
cost of human inputs by up to 30\% compared with the existing approaches. To
present the flexibility of our approach, we released a video
(https://youtu.be/jQPe1OILT0M) showing comparisons of the behaviours of agents
trained on different types of human input. We believe that our naturally
inspired human preferences with weakly supervised learning are beneficial for
precise reward learning and can be applied to state-of-the-art RL systems, such
as human-autonomy teaming systems.
Related papers
- Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning [12.742158403867002]
Reinforcement Learning from Human Feedback is a powerful paradigm for aligning foundation models to human values and preferences.
Current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population.
We develop a class of multimodal RLHF methods to address the need for pluralistic alignment.
arXiv Detail & Related papers (2024-08-19T15:18:30Z) - MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention [81.56607128684723]
We introduce MEReQ (Maximum-Entropy Residual-Q Inverse Reinforcement Learning), designed for sample-efficient alignment from human intervention.
MereQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions.
It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function.
arXiv Detail & Related papers (2024-06-24T01:51:09Z) - MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with
Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - Aligning Language Models with Human Preferences via a Bayesian Approach [11.984246334043673]
In the quest to advance human-centric natural language generation (NLG) systems, ensuring alignment between NLG models and human preferences is crucial.
This paper proposes a novel approach, which employs a Bayesian framework to account for the distribution of disagreements among human preferences as training a preference model.
Our method consistently exceeds previous SOTA models in both automatic and human evaluations.
arXiv Detail & Related papers (2023-10-09T15:15:05Z) - Humans are not Boltzmann Distributions: Challenges and Opportunities for
Modelling Human Feedback and Interaction in Reinforcement Learning [13.64577704565643]
We argue that these models are too simplistic and that RL researchers need to develop more realistic human models to design and evaluate their algorithms.
This paper calls for research from different disciplines to address key questions about how humans provide feedback to AIs and how we can build more robust human-in-the-loop RL systems.
arXiv Detail & Related papers (2022-06-27T13:58:51Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - Skill Preferences: Learning to Extract and Execute Robotic Skills from
Human Feedback [82.96694147237113]
We present Skill Preferences, an algorithm that learns a model over human preferences and uses it to extract human-aligned skills from offline data.
We show that SkiP enables a simulated kitchen robot to solve complex multi-step manipulation tasks.
arXiv Detail & Related papers (2021-08-11T18:04:08Z) - Human-guided Robot Behavior Learning: A GAN-assisted Preference-based
Reinforcement Learning Approach [2.9764834057085716]
We propose a new GAN-assisted human preference-based reinforcement learning approach.
It uses a generative adversarial network (GAN) to actively learn human preferences and then replace the role of human in assigning preferences.
Our method can achieve a reduction of about 99.8% human time without performance sacrifice.
arXiv Detail & Related papers (2020-10-15T01:44:06Z) - Deep reinforcement learning from human preferences [19.871618959160692]
We explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments.
We show that this approach can effectively solve complex RL tasks without access to the reward function.
This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems.
arXiv Detail & Related papers (2017-06-12T17:23:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.