Few-Shot Preference Learning for Human-in-the-Loop RL
- URL: http://arxiv.org/abs/2212.03363v1
- Date: Tue, 6 Dec 2022 23:12:26 GMT
- Title: Few-Shot Preference Learning for Human-in-the-Loop RL
- Authors: Joey Hejna, Dorsa Sadigh
- Abstract summary: Motivated by the success of meta-learning, we pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries.
We reduce the amount of online feedback needed to train manipulation policies in Meta-World by 20$times$, and demonstrate the effectiveness of our method on a real Franka Panda Robot.
- Score: 13.773589150740898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While reinforcement learning (RL) has become a more popular approach for
robotics, designing sufficiently informative reward functions for complex tasks
has proven to be extremely difficult due their inability to capture human
intent and policy exploitation. Preference based RL algorithms seek to overcome
these challenges by directly learning reward functions from human feedback.
Unfortunately, prior work either requires an unreasonable number of queries
implausible for any human to answer or overly restricts the class of reward
functions to guarantee the elicitation of the most informative queries,
resulting in models that are insufficiently expressive for realistic robotics
tasks. Contrary to most works that focus on query selection to \emph{minimize}
the amount of data required for learning reward functions, we take an opposite
approach: \emph{expanding} the pool of available data by viewing
human-in-the-loop RL through the more flexible lens of multi-task learning.
Motivated by the success of meta-learning, we pre-train preference models on
prior task data and quickly adapt them for new tasks using only a handful of
queries. Empirically, we reduce the amount of online feedback needed to train
manipulation policies in Meta-World by 20$\times$, and demonstrate the
effectiveness of our method on a real Franka Panda Robot. Moreover, this
reduction in query-complexity allows us to train robot policies from actual
human users. Videos of our results and code can be found at
https://sites.google.com/view/few-shot-preference-rl/home.
Related papers
- Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning [7.07264650720021]
Sub-optimal Data Pre-training, SDP, is an approach that leverages reward-free, sub-optimal data to improve HitL RL algorithms.
We show SDP can significantly improve or achieve competitive performance with state-of-the-art HitL RL algorithms.
arXiv Detail & Related papers (2024-04-30T18:58:33Z) - PREDILECT: Preferences Delineated with Zero-Shot Language-based
Reasoning in Reinforcement Learning [2.7387720378113554]
Preference-based reinforcement learning (RL) has emerged as a new field in robot learning.
We use the zero-shot capabilities of a large language model (LLM) to reason from the text provided by humans.
In both a simulated scenario and a user study, we reveal the effectiveness of our work by analyzing the feedback and its implications.
arXiv Detail & Related papers (2024-02-23T16:30:05Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - REBOOT: Reuse Data for Bootstrapping Efficient Real-World Dexterous
Manipulation [61.7171775202833]
We introduce an efficient system for learning dexterous manipulation skills withReinforcement learning.
The main idea of our approach is the integration of recent advances in sample-efficient RL and replay buffer bootstrapping.
Our system completes the real-world training cycle by incorporating learned resets via an imitation-based pickup policy.
arXiv Detail & Related papers (2023-09-06T19:05:31Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - Robot Learning of Mobile Manipulation with Reachability Behavior Priors [38.49783454634775]
Mobile Manipulation (MM) systems are ideal candidates for taking up the role of a personal assistant in unstructured real-world environments.
Among other challenges, MM requires effective coordination of the robot's embodiments for executing tasks that require both mobility and manipulation.
We study the integration of robotic reachability priors in actor-critic RL methods for accelerating the learning of MM for reaching and fetching tasks.
arXiv Detail & Related papers (2022-03-08T12:44:42Z) - Accelerating Robotic Reinforcement Learning via Parameterized Action
Primitives [92.0321404272942]
Reinforcement learning can be used to build general-purpose robotic systems.
However, training RL agents to solve robotics tasks still remains challenging.
In this work, we manually specify a library of robot action primitives (RAPS), parameterized with arguments that are learned by an RL policy.
We find that our simple change to the action interface substantially improves both the learning efficiency and task performance.
arXiv Detail & Related papers (2021-10-28T17:59:30Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z) - A Framework for Efficient Robotic Manipulation [79.10407063260473]
We show that a single robotic arm can learn sparse-reward manipulation policies from pixels.
We show that, given only 10 demonstrations, a single robotic arm can learn sparse-reward manipulation policies from pixels.
arXiv Detail & Related papers (2020-12-14T22:18:39Z) - Active Preference-Based Gaussian Process Regression for Reward Learning [42.697198807877925]
One common approach is to learn reward functions from collected expert demonstrations.
We present a preference-based learning approach, where as an alternative, the human feedback is only in the form of comparisons between trajectories.
Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework.
arXiv Detail & Related papers (2020-05-06T03:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.