Learning Reward Functions from Diverse Sources of Human Feedback:
Optimally Integrating Demonstrations and Preferences
- URL: http://arxiv.org/abs/2006.14091v2
- Date: Wed, 4 Aug 2021 07:57:25 GMT
- Title: Learning Reward Functions from Diverse Sources of Human Feedback:
Optimally Integrating Demonstrations and Preferences
- Authors: Erdem B{\i}y{\i}k, Dylan P. Losey, Malayandi Palan, Nicholas C.
Landolfi, Gleb Shevchuk, Dorsa Sadigh
- Abstract summary: We present a framework to integrate multiple sources of information, which are either passively or actively collected from human users.
In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward.
Our approach accounts for the human's ability to provide data: yielding user-friendly preference queries which are also theoretically optimal.
- Score: 14.683631546064932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward functions are a common way to specify the objective of a robot. As
designing reward functions can be extremely challenging, a more promising
approach is to directly learn reward functions from human teachers.
Importantly, data from human teachers can be collected either passively or
actively in a variety of forms: passive data sources include demonstrations,
(e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings)
are actively elicited. Prior research has independently applied reward learning
to these different data sources. However, there exist many domains where
multiple sources are complementary and expressive. Motivated by this general
problem, we present a framework to integrate multiple sources of information,
which are either passively or actively collected from human users. In
particular, we present an algorithm that first utilizes user demonstrations to
initialize a belief about the reward function, and then actively probes the
user with preference queries to zero-in on their true reward. This algorithm
not only enables us combine multiple data sources, but it also informs the
robot when it should leverage each type of information. Further, our approach
accounts for the human's ability to provide data: yielding user-friendly
preference queries which are also theoretically optimal. Our extensive
simulated experiments and user studies on a Fetch mobile manipulator
demonstrate the superiority and the usability of our integrated framework.
Related papers
- Adaptive Language-Guided Abstraction from Contrastive Explanations [53.48583372522492]
It is necessary to determine which features of the environment are relevant before determining how these features should be used to compute reward.
End-to-end methods for joint feature and reward learning often yield brittle reward functions that are sensitive to spurious state features.
This paper describes a method named ALGAE which alternates between using language models to iteratively identify human-meaningful features.
arXiv Detail & Related papers (2024-09-12T16:51:58Z) - Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning [12.742158403867002]
Reinforcement Learning from Human Feedback is a powerful paradigm for aligning foundation models to human values and preferences.
Current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population.
We develop a class of multimodal RLHF methods to address the need for pluralistic alignment.
arXiv Detail & Related papers (2024-08-19T15:18:30Z) - A Generalized Acquisition Function for Preference-based Reward Learning [12.158619866176487]
Preference-based reward learning is a popular technique for teaching robots and autonomous systems how a human user wants them to perform a task.
Previous works have shown that actively synthesizing preference queries to maximize information gain about the reward function parameters improves data efficiency.
We show that it is possible to optimize for learning the reward function up to a behavioral equivalence class, such as inducing the same ranking over behaviors, distribution over choices, or other related definitions of what makes two rewards similar.
arXiv Detail & Related papers (2024-03-09T20:32:17Z) - Reinforcement Learning from Passive Data via Latent Intentions [86.4969514480008]
We show that passive data can still be used to learn features that accelerate downstream RL.
Our approach learns from passive data by modeling intentions.
Our experiments demonstrate the ability to learn from many forms of passive data, including cross-embodiment video data and YouTube videos.
arXiv Detail & Related papers (2023-04-10T17:59:05Z) - Learning Preferences for Interactive Autonomy [1.90365714903665]
This thesis is an attempt towards learning reward functions from human users by using other, more reliable data modalities.
We first propose various forms of comparative feedback, e.g., pairwise comparisons, best-of-many choices, rankings, scaled comparisons; and describe how a robot can use these various forms of human feedback to infer a reward function.
arXiv Detail & Related papers (2022-10-19T21:34:51Z) - Denoised MDPs: Learning World Models Better Than the World Itself [94.74665254213588]
This work categorizes information out in the wild into four types based on controllability and relation with reward, and formulates useful information as that which is both controllable and reward-relevant.
Experiments on variants of DeepMind Control Suite and RoboDesk demonstrate superior performance of our denoised world model over using raw observations alone.
arXiv Detail & Related papers (2022-06-30T17:59:49Z) - Invariance in Policy Optimisation and Partial Identifiability in Reward
Learning [67.4640841144101]
We characterise the partial identifiability of the reward function given popular reward learning data sources.
We also analyse the impact of this partial identifiability for several downstream tasks, such as policy optimisation.
arXiv Detail & Related papers (2022-03-14T20:19:15Z) - Learning Multimodal Rewards from Rankings [7.266985088439535]
We go beyond learning a unimodal reward and focus on learning a multimodal reward function.
We formulate the multimodal reward learning as a mixture learning problem.
We conduct experiments and user studies using a multi-task variant of OpenAI's LunarLander and a real Fetch robot.
arXiv Detail & Related papers (2021-09-27T01:22:01Z) - Modeling Shared Responses in Neuroimaging Studies through MultiView ICA [94.31804763196116]
Group studies involving large cohorts of subjects are important to draw general conclusions about brain functional organization.
We propose a novel MultiView Independent Component Analysis model for group studies, where data from each subject are modeled as a linear combination of shared independent sources plus noise.
We demonstrate the usefulness of our approach first on fMRI data, where our model demonstrates improved sensitivity in identifying common sources among subjects.
arXiv Detail & Related papers (2020-06-11T17:29:53Z) - Active Preference-Based Gaussian Process Regression for Reward Learning [42.697198807877925]
One common approach is to learn reward functions from collected expert demonstrations.
We present a preference-based learning approach, where as an alternative, the human feedback is only in the form of comparisons between trajectories.
Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework.
arXiv Detail & Related papers (2020-05-06T03:29:27Z) - Multi-Center Federated Learning [62.57229809407692]
This paper proposes a novel multi-center aggregation mechanism for federated learning.
It learns multiple global models from the non-IID user data and simultaneously derives the optimal matching between users and centers.
Our experimental results on benchmark datasets show that our method outperforms several popular federated learning methods.
arXiv Detail & Related papers (2020-05-03T09:14:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.