Active Preference-Based Gaussian Process Regression for Reward Learning
- URL: http://arxiv.org/abs/2005.02575v2
- Date: Wed, 3 Jun 2020 23:08:00 GMT
- Title: Active Preference-Based Gaussian Process Regression for Reward Learning
- Authors: Erdem B{\i}y{\i}k, Nicolas Huynh, Mykel J. Kochenderfer, Dorsa Sadigh
- Abstract summary: One common approach is to learn reward functions from collected expert demonstrations.
We present a preference-based learning approach, where as an alternative, the human feedback is only in the form of comparisons between trajectories.
Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework.
- Score: 42.697198807877925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Designing reward functions is a challenging problem in AI and robotics.
Humans usually have a difficult time directly specifying all the desirable
behaviors that a robot needs to optimize. One common approach is to learn
reward functions from collected expert demonstrations. However, learning reward
functions from demonstrations introduces many challenges: some methods require
highly structured models, e.g. reward functions that are linear in some
predefined set of features, while others adopt less structured reward functions
that on the other hand require tremendous amount of data. In addition, humans
tend to have a difficult time providing demonstrations on robots with high
degrees of freedom, or even quantifying reward values for given demonstrations.
To address these challenges, we present a preference-based learning approach,
where as an alternative, the human feedback is only in the form of comparisons
between trajectories. Furthermore, we do not assume highly constrained
structures on the reward function. Instead, we model the reward function using
a Gaussian Process (GP) and propose a mathematical formulation to actively find
a GP using only human preferences. Our approach enables us to tackle both
inflexibility and data-inefficiency problems within a preference-based learning
framework. Our results in simulations and a user study suggest that our
approach can efficiently learn expressive reward functions for robotics tasks.
Related papers
- Adaptive Language-Guided Abstraction from Contrastive Explanations [53.48583372522492]
It is necessary to determine which features of the environment are relevant before determining how these features should be used to compute reward.
End-to-end methods for joint feature and reward learning often yield brittle reward functions that are sensitive to spurious state features.
This paper describes a method named ALGAE which alternates between using language models to iteratively identify human-meaningful features.
arXiv Detail & Related papers (2024-09-12T16:51:58Z) - Learning Reward for Robot Skills Using Large Language Models via Self-Alignment [11.639973274337274]
Large Language Models (LLM) contain valuable task-related knowledge that can potentially aid in the learning of reward functions.
We propose a method to learn rewards more efficiently in the absence of humans.
arXiv Detail & Related papers (2024-05-12T04:57:43Z) - Few-Shot Preference Learning for Human-in-the-Loop RL [13.773589150740898]
Motivated by the success of meta-learning, we pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries.
We reduce the amount of online feedback needed to train manipulation policies in Meta-World by 20$times$, and demonstrate the effectiveness of our method on a real Franka Panda Robot.
arXiv Detail & Related papers (2022-12-06T23:12:26Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z) - Learning Preferences for Interactive Autonomy [1.90365714903665]
This thesis is an attempt towards learning reward functions from human users by using other, more reliable data modalities.
We first propose various forms of comparative feedback, e.g., pairwise comparisons, best-of-many choices, rankings, scaled comparisons; and describe how a robot can use these various forms of human feedback to infer a reward function.
arXiv Detail & Related papers (2022-10-19T21:34:51Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Generative Adversarial Reward Learning for Generalized Behavior Tendency
Inference [71.11416263370823]
We propose a generative inverse reinforcement learning for user behavioral preference modelling.
Our model can automatically learn the rewards from user's actions based on discriminative actor-critic network and Wasserstein GAN.
arXiv Detail & Related papers (2021-05-03T13:14:25Z) - Replacing Rewards with Examples: Example-Based Policy Search via
Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function.
In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved.
Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z) - Learning Reward Functions from Diverse Sources of Human Feedback:
Optimally Integrating Demonstrations and Preferences [14.683631546064932]
We present a framework to integrate multiple sources of information, which are either passively or actively collected from human users.
In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward.
Our approach accounts for the human's ability to provide data: yielding user-friendly preference queries which are also theoretically optimal.
arXiv Detail & Related papers (2020-06-24T22:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.