Inverse Reinforcement Learning via Matching of Optimality Profiles
- URL: http://arxiv.org/abs/2011.09264v2
- Date: Thu, 19 Nov 2020 08:55:03 GMT
- Title: Inverse Reinforcement Learning via Matching of Optimality Profiles
- Authors: Luis Haug, Ivan Ovinnikov, Eugene Bykovets
- Abstract summary: We propose an algorithm that learns a reward function from demonstrations of suboptimal or heterogeneous performance.
We show that our method is capable of learning reward functions such that policies trained to optimize them outperform the demonstrations used for fitting the reward functions.
- Score: 2.561053769852449
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The goal of inverse reinforcement learning (IRL) is to infer a reward
function that explains the behavior of an agent performing a task. The
assumption that most approaches make is that the demonstrated behavior is
near-optimal. In many real-world scenarios, however, examples of truly optimal
behavior are scarce, and it is desirable to effectively leverage sets of
demonstrations of suboptimal or heterogeneous performance, which are easier to
obtain. We propose an algorithm that learns a reward function from such
demonstrations together with a weak supervision signal in the form of a
distribution over rewards collected during the demonstrations (or, more
generally, a distribution over cumulative discounted future rewards). We view
such distributions, which we also refer to as optimality profiles, as summaries
of the degree of optimality of the demonstrations that may, for example,
reflect the opinion of a human expert. Given an optimality profile and a small
amount of additional supervision, our algorithm fits a reward function, modeled
as a neural network, by essentially minimizing the Wasserstein distance between
the corresponding induced distribution and the optimality profile. We show that
our method is capable of learning reward functions such that policies trained
to optimize them outperform the demonstrations used for fitting the reward
functions.
Related papers
- Inverse Reinforcement Learning with Sub-optimal Experts [56.553106680769474]
We study the theoretical properties of the class of reward functions that are compatible with a given set of experts.
Our results show that the presence of multiple sub-optimal experts can significantly shrink the set of compatible rewards.
We analyze a uniform sampling algorithm that results in being minimax optimal whenever the sub-optimal experts' performance level is sufficiently close to the one of the optimal agent.
arXiv Detail & Related papers (2024-01-08T12:39:25Z) - Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal.
We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths.
We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z) - D-Shape: Demonstration-Shaped Reinforcement Learning via Goal
Conditioning [48.57484755946714]
D-Shape is a new method for combining imitation learning (IL) and reinforcement learning (RL)
This paper introduces D-Shape, a new method for combining IL and RL that uses ideas from reward shaping and goal-conditioned RL to resolve the above conflict.
We experimentally validate D-Shape in sparse-reward gridworld domains, showing that it both improves over RL in terms of sample efficiency and converges consistently to the optimal policy.
arXiv Detail & Related papers (2022-10-26T02:28:32Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - LiMIIRL: Lightweight Multiple-Intent Inverse Reinforcement Learning [5.1779694507922835]
Multiple-Intent Inverse Reinforcement Learning seeks to find a reward function ensemble to rationalize demonstrations of different but unlabelled intents.
We present a warm-start strategy based on up-front clustering of the demonstrations in feature space.
We also propose a MI-IRL performance metric that generalizes the popular Expected Value Difference measure.
arXiv Detail & Related papers (2021-06-03T12:00:38Z) - Learning One Representation to Optimize All Rewards [19.636676744015197]
We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process.
It provides explicit near-optimal policies for any reward specified a posteriori.
This is a step towards learning controllable agents in arbitrary black-box environments.
arXiv Detail & Related papers (2021-03-14T15:00:08Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.